北京大学R语言教程(李东风)第43章:基于树的方法

43.1 树回归的简单演示

决策树方法按不同自变量的不同值, 分层地把训练集分组。 每层使用一个变量, 所以这样的分组构成一个二叉树表示。 为了预测一个观测的类归属, 找到它所属的组, 用组的类归属或大多数观测的类归属进行预测。 这样的方法称为决策树(decision tree)。 决策树方法既可以用于判别问题, 也可以用于回归问题,称为回归树。

决策树的好处是容易解释, 在自变量为分类变量时没有额外困难。 但预测准确率可能比其它有监督学习方法差。

改进方法包括装袋法(bagging)、随机森林(random forests)、 提升法(boosting)。 这些改进方法都是把许多棵树合并在一起, 通常能改善准确率但是可解释性变差。

对Hitters数据,用Years和Hits作因变量预测log(Salaray)。

library(tidyverse)
library(ISLR) # 参考书对应的包

data(Hitters)
da_hit <- na.omit(Hitters); dim(da_hit)
## [1] 263  20
library(rsample)
set.seed(101)
hit_split <- initial_split(
  da_hit, prop = 0.80, strata = Salary)
hit_train <- training(hit_split)
hit_test <- testing(hit_split)

在训练集上建立完整的树:

library(tree)
tr1 <- tree(
  log(Salary) ~ Years + Hits, 
  data = hit_train)

剪枝为只有3个叶结点:

tr1b <- prune.tree(tr1, best=3)

显示树:

print(tr1b)
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 208 161.20 5.936  
##   2) Years < 4.5 72  35.07 5.162 *
##   3) Years > 4.5 136  60.05 6.346  
##     6) Hits < 117.5 70  23.60 5.986 *
##     7) Hits > 117.5 66  17.75 6.728 *

显示概括:

print(summary(tr1b))
## 
## Regression tree:
## snip.tree(tree = tr1, nodes = c(6L, 2L))
## Number of terminal nodes:  3 
## Residual mean deviance:  0.3727 = 76.41 / 205 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.2280 -0.3740 -0.0589  0.0000  0.3414  2.5010

做树图:

plot(tr1b); text(tr1b, pretty=0)
北京大学R语言教程(李东风)第43章:基于树的方法

树的深度(depth)是指从根节点到最远的叶节点经过的步数, 比如,上图的树的深度为2, 为了用叶结点给出因变量预测值, 最多需要2次判断。

43.2 树回归

树的深度是一个复杂度指标, 是判别树的超参数, 需要调优。 关于如何进行超参数调优并在测试集上计算性能, tidymodels有系统的方法, 参见47.3。 这里为了对方法进行更直接的演示, 直接调用交叉验证函数进行超参数调优并在测试集上计算预测精度指标。

对训练集上的未剪枝树用交叉验证方法寻找最优大小:

cv1 <- cv.tree(tr1)
print(cv1)
## $size
## [1] 9 8 6 5 4 3 2 1
## 
## $dev
## [1]  78.50049  81.47727  81.43670  79.43120  79.07190  92.16026 105.14082 167.75233
## 
## $k
## [1]      -Inf  2.445601  2.639571  3.186007  4.133744  8.296626 18.711912 66.037022
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
plot(cv1$size, cv1$dev, type='b')
best.size <- cv1$size[which.min(cv1$dev)[1]]
abline(v=best.size, col='gray')
北京大学R语言教程(李东风)第43章:基于树的方法

最优大小为9。 但是从图上看, 大小4的树已经效果很好。

获得训练集上构造的树剪枝后的结果:

tr1b <- prune.tree(tr1, best=best.size)

在测试集上计算预测根均方误差:

pred.test <- predict(tr1b, newdata = hit_test)
test.rmse <- 
  mean( (hit_test$Salary - exp(pred.test))^2 ) |> sqrt()
test.rmse
## [1] 281.7956

RMSE=281.8, 比子集回归、岭回归(RMSE=240.7)、lasso的结果都差很多。

用训练集的因变量平均值估计测试集的因变量值可以作为一个最初等的用来对比的基准, 其根均方误差为:

worst.rmse <- 
  mean( (hit_test$Salary - mean(hit_train$Salary))^2 ) |>
  sqrt()
worst.rmse
## [1] 413.1353

用所有数据来构造未剪枝树:

tr2 <- tree(log(Salary) ~ ., data = hit_train)

用训练集上得到的子树大小剪枝:

tr2b <- prune.tree(tr2, best=best.size)
plot(tr2b); text(tr2b, pretty=0)
北京大学R语言教程(李东风)第43章:基于树的方法

这样的结果可以用于同一问题的新数据的预测。

43.3 装袋法

判别树在不同的训练集、测试集划分上可以产生很大变化, 说明其预测值方差较大。 利用bootstrap的思想, 可以随机选取许多个训练集, 把许多个训练集的模型结果平均, 就可以降低预测值的方差。

办法是从一个训练集中用有放回抽样的方法抽取B个训练集, 设第b个抽取的训练集得到的回归函数为f̂ ∗b(⋅), 则最后的回归函数是这些回归函数的平均值:

f̂ bagging(x)=1B∑b=1bf̂ ∗b(x).

这称为装袋法(bagging)。 装袋法对改善判别与回归树的预测精度十分有效。

装袋法的步骤如下:

  • 从训练集中取B个有放回随机抽样的bootstrap训练集,B取为几百到几千之间。
  • 对每个bootstrap训练集,估计未剪枝的树。
  • 如果因变量是连续变量,对测试样品,用所有的树的预测值的平均值作预测。
  • 如果因变量是分类变量,对测试样品,可以用所有树预测类的多数投票决定预测值。

装袋法也可以用来改进其他的回归和判别方法。

装袋后不能再用图形表示,模型可解释性较差。 但是,可以度量自变量在预测中的重要程度。 在回归问题中, 可以计算每个自变量在所有B个树中平均减少的残差平方和的量, 以此度量其重要度。 在判别问题中, 可以计算每个自变量在所有B个树种平均减少的基尼系数的量, 以此度量其重要度。

除了可以用测试集、交叉验证方法, 还可以使用袋外观测的预测误差来度量模型预测精度。 用bootstrap再抽样获得多个训练集时每个bootstrap训练集总会遗漏一些观测, 平均每个bootstrap训练集会遗漏三分之一的观测。 对每个观测,大约有B/3棵树没有用到此观测, 可以用这些树的预测值平均来预测此观测,得到一个误差估计, 这样得到的均方误差估计或错判率称为袋外观测估计(OOB估计)。 好处是不用很多额外的工作。

对训练集用装袋法:

library(randomForest)
bag1 <- randomForest(
  log(Salary) ~ ., 
  data = hit_train, 
  mtry=ncol(hit_train)-1, 
  importance=TRUE)
bag1
## 
## Call:
##  randomForest(formula = log(Salary) ~ ., data = hit_train, mtry = ncol(hit_train) -      1, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 19
## 
##           Mean of squared residuals: 0.1980098
##                     % Var explained: 74.44

注意randomForest()函数实际是随机森林法, 但是当mtry的值取为所有自变量个数时就是装袋法。

对测试集进行预报:

pred2 <- predict(bag1, newdata = hit_test)
test.rmse2 <- 
  mean( (hit_test$Salary - exp(pred2))^2 ) |> sqrt()
test.rmse2
## [1] 202.0765

RMSE=202.1, 比判别树的281.8改进很大, 比岭回归的240.7也有很大优势。

在全集上使用装袋法:

bag2 <- randomForest(
  log(Salary) ~ ., 
  data = da_hit, 
  mtry=ncol(da_hit)-1, 
  importance=TRUE)
bag2
## 
## Call:
##  randomForest(formula = log(Salary) ~ ., data = da_hit, mtry = ncol(da_hit) -      1, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 19
## 
##           Mean of squared residuals: 0.1937377
##                     % Var explained: 75.4

变量的重要度数值和图形: 各变量的重要度数值及其图形:

importance(bag2)
##              %IncMSE IncNodePurity
## AtBat     10.7883286     8.1667778
## Hits       8.4949590     8.1050931
## HmRun      3.0595593     1.9280305
## Runs       7.6675720     3.8182568
## RBI        4.5596220     5.2948207
## Walks      8.0850741     6.9407788
## Years     10.0302334     2.2203968
## CAtBat    26.2359706    77.4088339
## CHits     12.8371027    24.0757798
## CHmRun     4.4959747     4.3641893
## CRuns     14.9272144    36.1514017
## CRBI      15.6525107    11.3891366
## CWalks     6.7160244     6.5333487
## League    -0.7821402     0.2073524
## Division  -1.0121206     0.2339053
## PutOuts    0.2771301     3.7336895
## Assists   -2.5795517     1.7112880
## Errors     0.9658563     1.7447031
## NewLeague  1.2244401     0.3597582
varImpPlot(bag2)

Hitters数据装袋法的变量重要性结果

图43.1: Hitters数据装袋法的变量重要性结果

最重要的自变量是CAtBats, 其次有CRuns, CHits等。

如何计算变量重要度? 基于树的方法, 每个叶节点的纯度越高(叶结点中所有观测的标签相同,或者因变量值相等), 模型拟合优度越好。 所以, 对每一个变量, 可以计算其在作为分枝用的变量时, 对中间节点的纯度指标的改善量, 将这些改善量加起来。 对装袋法、随机森林、提升法(如GBM), 则是计算每个变量对损失函数的改善量。

不同的机器学习算法对变量重要程度有不同的定义, 比如, 广义线性模型(GLM)用标准化后的自变量的系数估计的绝对值大小作为重要程度度量。

43.4 随机森林

随机森林的思想与装袋法类似, 但是试图使得参加平均的各个树之间变得比较独立, 以减少正相关的预测在计算平均时的标准误差膨胀问题。 仍采用有放回抽样得到的多个bootstrap训练集, 但是对每个bootstrap训练集构造判别树时, 每次分叉时不考虑所有自变量, 而是仅考虑随机选取的一个自变量子集。 这个自变量子集的自变量个数是一个模型超参数。

对判别树, 每次分叉时选取的自变量个数通常取m≈p‾√个。 比如,对Heart数据的13个自变量, 每次分叉时仅随机选取4个纳入考察范围。

随机森林的想法是基于正相关的样本在平均时并不能很好地降低方差, 独立样本能比较好地降低方差。 如果存在一个最重要的变量, 如果不加限制这个最重要的变量总会是第一个分叉, 使得B棵树相似程度很高。 随机森林解决这个问题的办法是限制分叉时可选的变量子集。

随机森林也可以用来改进其他的回归和判别方法。

装袋法和随机森林都可以用R扩展包randomForest的 randomForest()函数实现。 当此函数的mtry参数取为自变量个数时,执行的就是装袋法; mtry取缺省值时,执行随机森林算法。 执行随机森林算法时, randomForest()函数在回归问题时分叉时考虑的自变量个数取m≈p/3, 在判别问题时取m≈p‾√。

对训练集用随机森林法:

library(randomForest)
rf1 <- randomForest(
  log(Salary) ~ ., 
  data = hit_train, 
  importance=TRUE)
rf1
## 
## Call:
##  randomForest(formula = log(Salary) ~ ., data = hit_train, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 0.1895383
##                     % Var explained: 75.54

mtry的值取为缺省值时执行随机森林算法。

对测试集进行预报:

pred3 <- predict(rf1, newdata = hit_test)
test.rmse3 <- 
  mean( (hit_test$Salary - exp(pred3))^2 ) |> sqrt()
test.rmse3
## [1] 199.8305

RMSE=199.8, 与装袋法(RMSE=202.1)相近。

在全集上使用随机森林:

rf2 <- randomForest(
  log(Salary) ~ ., 
  data = da_hit, 
  importance=TRUE)
rf2
## 
## Call:
##  randomForest(formula = log(Salary) ~ ., data = da_hit, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 0.1799338
##                     % Var explained: 77.16

各变量的重要度数值及其图形:

importance(rf2)
##              %IncMSE IncNodePurity
## AtBat     10.8759999     7.4439449
## Hits       8.1725427     7.9481573
## HmRun      4.4016043     2.5935154
## Runs       9.2818801     4.8293772
## RBI        8.3514919     6.2292463
## Walks      8.8164532     6.1787450
## Years     10.6053647     5.0062719
## CAtBat    16.9507148    41.0814114
## CHits     17.6578387    41.5968368
## CHmRun     8.1292431     7.1035557
## CRuns     13.8588073    30.0948238
## CRBI      14.2775671    19.7903282
## CWalks    10.3261013    15.7222964
## League     2.0932305     0.2700378
## Division  -0.2466121     0.3021408
## PutOuts    3.1669627     3.2670212
## Assists   -0.6733127     1.7261075
## Errors     1.5649441     1.6376596
## NewLeague  1.0967640     0.3386188
varImpPlot(rf2)

Hitters数据随机森林法的变量重要度结果

图43.2: Hitters数据随机森林法的变量重要度结果

最重要的自变量是CAtBats, CRuns, CHits, CWalks, CRBI等。

43.5 提升法

提升法(Boosting), 也称为梯度提升法, 也是可以用在多种回归和判别问题中的方法。 提升法的想法是, 用比较简单的模型拟合因变量, 计算残差, 然后以残差为新的因变量建模, 仍使用简单的模型, 把两次的回归函数作加权和, 得到新的残差后,再以新残差作为因变量建模, 如此重复地更新回归函数, 得到由多个回归函数加权和组成的最终的回归函数。

加权一般取为比较小的值, 其目的是降低逼近速度。 统计学习问题中降低逼近速度一般结果更好。

提升法算法:

  • [(1)] 对训练集,设置ri=yi,并令初始回归函数为f̂ (⋅)=0。
  • [(2)] 对b=1,2,…,B重复执行:
    • [(a)] 以训练集的自变量为自变量,以r为因变量,拟合一个仅有d个分叉的简单树回归函数, 设为f̂ b;
    • [(b)] 更新回归函数,添加一个压缩过的树回归函数:f̂ (x)←f̂ (x)+λf̂ b(x);
    • [(c)] 更新残差:ri←ri−λf̂ b(xi).
  • [(3)] 提升法的回归函数为f̂ (x)=∑b=1Bλf̂ b(x).

用多少个回归函数做加权和,即B的选取问题。 取得B太大也会有过度拟合, 但是只要B不太大这个问题不严重。 可以用交叉验证选择B的值。

收缩系数λ。 是一个小的正数, 控制学习速度, 经常用0.01, 0.001这样的值, 与要解决的问题有关。 取λ很小,就需要取B很大。

用来控制每个回归函数复杂度的参数, 对树回归而言就是树的大小, 用树的深度d表示。 深度等于1则仅使用一个自变量, 仅有一次分叉, 就是二叉树, 这样多棵树相加, 相当于各个变量的可加模型, 没有交互作用效应, 这样的可加模型往往就很好。 d>1时就加入了交互项, 比如d=2, 就可以用两个变量, 用叶结点预测因变量时, 最多可以用两个自变量作两次判断, 因为树模型是非线性的, 将许多棵这样的深度为2的树相加, 就可以包含自变量两两之间的非线性的相互作用效应。

gbm实现了提升法。 interaction.depth表示树的深度(复杂度), n.trees表示用多少棵树相加。 shrinkage表示学习速度, 即算法中的λ。 n.minobsinnode表示每个叶结点至少应包含的观测点数, 可以设置这个参数, 以避免过少的训练样例也单独作为一个规则。 这些都是超参数, 应进行超参数调优, 这里仅固定了这些超参数进行演示。

在训练集上拟合:

library(gbm)
set.seed(1)
bst1 <- gbm(
  log(Salary) ~ ., 
  data = hit_train, 
  distribution = "gaussian",  
  n.trees=5000,  
  interaction.depth=4)
summary(bst1)
北京大学R语言教程(李东风)第43章:基于树的方法
##                 var    rel.inf
## CAtBat       CAtBat 23.4075576
## CRBI           CRBI  7.2138130
## CRuns         CRuns  7.1524081
## PutOuts     PutOuts  6.3402558
## CHits         CHits  5.6558782
## CHmRun       CHmRun  5.6051624
## Walks         Walks  5.1110904
## Assists     Assists  4.8197073
## Hits           Hits  4.7970012
## CWalks       CWalks  4.7150910
## AtBat         AtBat  4.3214885
## HmRun         HmRun  4.1297511
## RBI             RBI  3.9799787
## Years         Years  3.5699618
## Runs           Runs  3.5257357
## Errors       Errors  3.5019377
## Division   Division  0.8191874
## League       League  0.7703509
## NewLeague NewLeague  0.5636432

CAtBat是最重要的变量。

在测试集上预报,并计算根均方误差:

yhat <- predict(
  bst1, 
  newdata = hit_test)
## Using 5000 trees...
mean( (hit_test$Salary - exp(yhat))^2 ) |> sqrt()
## [1] 274.633

RMSE=274.6, 结果比较差, 需要进行参数调优。

43.6 心脏病诊断建模预报

Heart数据是心脏病诊断的数据, 因变量AHD为是否有心脏病, 试图用各个自变量预测(判别)。

读入Heart数据集,并去掉有缺失值的观测:

Heart <- read_csv(
  "data/Heart.csv",
  show_col_types = FALSE) |>
  dplyr::select(-1) |>
  mutate(
    AHD = factor(AHD, levels=c("Yes", "No"))
  )
## New names:
## • `` -> `...1`
Heart <- na.omit(Heart)
glimpse(Heart)
## Rows: 297
## Columns: 14
## $ Age       <dbl> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57, 48, 54, 48, 49, 64, 58, 58, 58, 60, 50, 58, 66, 43, 40, 69, 60, 64, 59, 44, 42, 43, 57, 55, 61, 65, 40, 71, 59, 61, 58, 51, 50, 65, 53, 41, 65, 44, 44, 60, 54, 50, 41, 54, 51, 51, 46, 58, 54, 54, 60, 60, 54, 59, 46, 65, 67, 62, 65, 44, 65, 60, 51, 48, 58, 45, 53, 39, 68, 52, 44, 47, 53, 51, 66, 62, 62, 44, 63, 52, 59, 60, 52, 48, 45, 34, 57, 71, 49, 54, 59, 57, 61, 39, 61, 56, 52, 43, 62, 41, 58, 35, 63, 65, 48, …
## $ Sex       <dbl> 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ ChestPain <chr> "typical", "asymptomatic", "asymptomatic", "nonanginal", "nontypical", "nontypical", "asymptomatic", "asymptomatic", "asymptomatic", "asymptomatic", "asymptomatic", "nontypical", "nonanginal", "nontypical", "nonanginal", "nonanginal", "nontypical", "asymptomatic", "nonanginal", "nontypical", "typical", "typical", "nontypical", "nonanginal", "asymptomatic", "nonanginal", "nonanginal", "typical", "asymptomatic", "asymptomatic", "typical", "asymptomatic", "nonanginal", "asymptom…
## $ RestBP    <dbl> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130, 120, 172, 150, 110, 140, 130, 130, 110, 150, 120, 132, 130, 120, 120, 150, 150, 110, 140, 117, 140, 135, 130, 140, 120, 150, 132, 150, 150, 140, 160, 150, 130, 112, 110, 150, 140, 130, 105, 120, 112, 130, 130, 124, 140, 110, 125, 125, 130, 142, 128, 135, 120, 145, 140, 150, 170, 150, 155, 125, 120, 110, 110, 160, 125, 140, 130, 150, 104, 130, 140, 180, 120, 140, 138, 138, 130, 120, 160, 130, 108, 135, 128, 110, …
## $ Chol      <dbl> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256, 263, 199, 168, 229, 239, 275, 266, 211, 283, 284, 224, 206, 219, 340, 226, 247, 167, 239, 230, 335, 234, 233, 226, 177, 276, 353, 243, 225, 199, 302, 212, 330, 230, 175, 243, 417, 197, 198, 177, 290, 219, 253, 266, 233, 172, 273, 213, 305, 177, 216, 304, 188, 282, 185, 232, 326, 231, 269, 254, 267, 248, 197, 360, 258, 308, 245, 270, 208, 264, 321, 274, 325, 235, 257, 234, 256, 302, 164, 231, 141, 252, 255, 239, …
## $ Fbs       <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ RestECG   <dbl> 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 2, 0, 0, 0, 2, 2, 0, 2, 2, 2, 0, 0, 2, 2, 0, 2, 0, 2, 2, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 0, 0, 0, 2, 0, 0, 2, 2, 2, 2, 0, 0, 2, 0, 2, 2, 2, 2, 0, 0, 2, 2, 2, 0, 0, 0, 2, 0, 2, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, …
## $ MaxHR     <dbl> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 142, 173, 162, 174, 168, 160, 139, 171, 144, 162, 160, 173, 132, 158, 172, 114, 171, 114, 151, 160, 158, 161, 179, 178, 120, 112, 132, 137, 114, 178, 162, 157, 169, 165, 123, 128, 157, 152, 168, 140, 153, 188, 144, 109, 163, 158, 152, 125, 142, 160, 131, 170, 113, 142, 155, 165, 140, 147, 148, 163, 99, 158, 177, 151, 141, 142, 180, 111, 148, 143, 182, 150, 172, 180, 156, 160, 149, 151, 145, 146, 175, 172, 161, 142, 1…
## $ ExAng     <dbl> 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, …
## $ Oldpeak   <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0.6, 0.0, 0.5, 1.6, 1.0, 1.2, 0.2, 0.6, 1.8, 1.0, 1.8, 3.2, 2.4, 1.6, 0.0, 2.6, 1.5, 2.0, 1.8, 1.4, 0.0, 0.5, 0.4, 0.0, 2.5, 0.6, 1.2, 1.0, 1.0, 1.4, 0.4, 1.6, 0.0, 2.5, 0.6, 2.6, 0.8, 1.2, 0.0, 0.4, 0.0, 0.0, 1.4, 2.2, 0.6, 0.0, 0.5, 1.4, 1.2, 1.4, 2.2, 0.0, 1.4, 2.8, 3.0, 1.6, 3.4, 3.6, 0.8, 0.2, 1.8, 0.6, 0.0, 0.8, 2.8, 1.5, 0.2, 0.8, 3.0, 0.4, 0.0, 1.6, 0.2, 0.0, 0.0, 0.0, 0.5, 0.4, 6.2, 1.8, 0.6, 0.0, 0.0, 1.2, …
## $ Slope     <dbl> 3, 2, 2, 3, 1, 1, 3, 1, 2, 3, 2, 2, 2, 1, 1, 1, 3, 1, 1, 1, 2, 1, 2, 1, 2, 2, 1, 3, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 1, 2, 1, 3, 1, 1, 1, 1, 1, 2, 2, 1, 3, 1, 2, 3, 2, 1, 2, 2, 2, 1, 3, 2, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 2, 3, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 1, 2, 2, 3, 2, 2, 3, 2, 1, 1, 2, 1, 1, 1, 2, 2, 3, 2, 2, 2, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 2, 2, 2, 2, 3, 2, 1, 1, 2, 1, 1, …
## $ Ca        <dbl> 0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 1, 1, 0, 3, 0, 2, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 3, 0, 1, 2, 0, 0, 0, 0, 0, 2, 2, 2, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 1, 1, 2, 1, 0, 0, 0, 1, 1, 3, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 3, 1, 2, 3, 0, 0, 1, 0, 2, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 1, 0, 0, 0, 1, 1, 3, 0, 2, 2, 1, 0, …
## $ Thal      <chr> "fixed", "normal", "reversable", "normal", "normal", "normal", "normal", "normal", "reversable", "reversable", "fixed", "normal", "fixed", "reversable", "reversable", "normal", "reversable", "normal", "normal", "normal", "normal", "normal", "normal", "reversable", "reversable", "normal", "normal", "normal", "normal", "reversable", "normal", "reversable", "normal", "reversable", "normal", "normal", "reversable", "fixed", "reversable", "normal", "reversable", "reversable", "nor…
## $ AHD       <fct> No, Yes, Yes, No, No, No, Yes, No, Yes, Yes, No, No, Yes, No, No, No, Yes, No, No, No, No, No, Yes, Yes, Yes, No, No, No, No, Yes, No, Yes, Yes, No, No, No, Yes, Yes, Yes, No, Yes, No, No, No, Yes, Yes, No, Yes, No, No, No, No, Yes, No, Yes, Yes, Yes, Yes, No, No, Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, Yes, No, Yes, Yes, Yes, Yes, No, Yes, No, No, Yes, No, No, No, Yes, No, No, No, No, No, No, Yes, No, No, No, Yes, Yes, Yes, No, No, No, No, No, No, Yes, No, Yes, Yes, Yes, Y…
t(summary(Heart))
##                                                                                                                       
##      Age   Min.   :29.00      1st Qu.:48.00      Median :56.00      Mean   :54.54    3rd Qu.:61.00    Max.   :77.00   
##      Sex   Min.   :0.0000     1st Qu.:0.0000     Median :1.0000     Mean   :0.6768   3rd Qu.:1.0000   Max.   :1.0000  
##  ChestPain Length:297         Class :character   Mode  :character                                                     
##     RestBP Min.   : 94.0      1st Qu.:120.0      Median :130.0      Mean   :131.7    3rd Qu.:140.0    Max.   :200.0   
##      Chol  Min.   :126.0      1st Qu.:211.0      Median :243.0      Mean   :247.4    3rd Qu.:276.0    Max.   :564.0   
##      Fbs   Min.   :0.0000     1st Qu.:0.0000     Median :0.0000     Mean   :0.1448   3rd Qu.:0.0000   Max.   :1.0000  
##    RestECG Min.   :0.0000     1st Qu.:0.0000     Median :1.0000     Mean   :0.9966   3rd Qu.:2.0000   Max.   :2.0000  
##     MaxHR  Min.   : 71.0      1st Qu.:133.0      Median :153.0      Mean   :149.6    3rd Qu.:166.0    Max.   :202.0   
##     ExAng  Min.   :0.0000     1st Qu.:0.0000     Median :0.0000     Mean   :0.3266   3rd Qu.:1.0000   Max.   :1.0000  
##    Oldpeak Min.   :0.000      1st Qu.:0.000      Median :0.800      Mean   :1.056    3rd Qu.:1.600    Max.   :6.200   
##     Slope  Min.   :1.000      1st Qu.:1.000      Median :2.000      Mean   :1.603    3rd Qu.:2.000    Max.   :3.000   
##       Ca   Min.   :0.0000     1st Qu.:0.0000     Median :0.0000     Mean   :0.6768   3rd Qu.:1.0000   Max.   :3.0000  
##     Thal   Length:297         Class :character   Mode  :character                                                     
##  AHD       Yes:137            No :160

数据下载:Heart.csv

43.6.1 划分训练集与测试集

简单地把观测分为一半训练集、一半测试集:

library(rsample)
set.seed(101)
heart_split <- initial_split(
  Heart, prop = 0.50)
heart_train <- training(heart_split)
heart_test <- testing(heart_split)
test.y <- heart_test$AHD

43.6.2 判别树

在训练集上建立未剪枝的判别树:

tr1 <- tree(AHD ~ ., data = heart_train)
## Warning in tree(AHD ~ ., data = heart_train): NAs introduced by coercion
plot(tr1); text(tr1, pretty=0)
北京大学R语言教程(李东风)第43章:基于树的方法

注意剪枝后树的显示中, 如果内部节点的自变量存在分类变量, 这时按照这个自变量分叉时, 取指定的某几个分类值时对应分支Yes, 取其它的分类值时对应分支No。

43.6.2.1 适当剪枝

用交叉验证方法确定剪枝保留的叶子个数, 剪枝时按照错判率(等于1减去正确率)执行:

cv1 <- cv.tree(tr1, FUN=prune.misclass)
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
## Warning in tree(model = m[rand != i, , drop = FALSE]): NAs introduced by coercion
## Warning in pred1.tree(tree, tree.matrix(nd)): NAs introduced by coercion
cv1
## $size
## [1] 16 12  6  3  2  1
## 
## $dev
## [1] 51 50 53 47 57 75
## 
## $k
## [1]      -Inf  0.000000  1.666667  2.000000 12.000000 24.000000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
plot(cv1$size, cv1$dev, type='b', xlab='size', ylab='dev')
北京大学R语言教程(李东风)第43章:基于树的方法
best.size <- cv1$size[which.min(cv1$dev)]

最优的大小是3。

对训练集生成剪枝结果:

tr1b <- prune.misclass(tr1, best=best.size)
plot(tr1b); text(tr1b, pretty=0)

Heart数据回归树

图43.3: Heart数据回归树

43.6.2.2 对测试集计算误判率

pred1 <- predict(tr1b, heart_test, type='class')
## Warning in pred1.tree(object, tree.matrix(newdata)): NAs introduced by coercion
tab1 <- table(pred1, test.y); tab1
##      test.y
## pred1 Yes No
##   Yes  52 30
##   No    6 61
test.err <- (tab1[1,2]+tab1[2,1])/sum(tab1[]); test.err
## [1] 0.2416107

对测试集的错判率约24%。

利用未剪枝的树对测试集进行预测, 一般比剪枝后的结果差:

pred1a <- predict(tr1, heart_test, type='class')
## Warning in pred1.tree(object, tree.matrix(newdata)): NAs introduced by coercion
tab1a <- table(pred1a, test.y); tab1a
##       test.y
## pred1a Yes No
##    Yes  42 25
##    No   16 66
test.err1a <- (tab1a[1,2]+tab1a[2,1])/sum(tab1a[]); test.err1a
## [1] 0.2751678

43.6.2.3 利用全集数据建立剪枝判别树

tr2 <- tree(AHD ~ ., data=Heart)
## Warning in tree(AHD ~ ., data = Heart): NAs introduced by coercion
tr2b <- prune.misclass(tr2, best=best.size)
plot(tr2b); text(tr2b, pretty=0)
北京大学R语言教程(李东风)第43章:基于树的方法

43.6.3 用装袋法

对训练集用装袋法:

bag1 <- randomForest(
  AHD ~ ., 
  data = heart_train, 
  mtry=13, 
  importance=TRUE)
bag1
## 
## Call:
##  randomForest(formula = AHD ~ ., data = heart_train, mtry = 13,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##         OOB estimate of  error rate: 23.65%
## Confusion matrix:
##     Yes No class.error
## Yes  61 18   0.2278481
## No   17 52   0.2463768

注意randomForest()函数实际是随机森林法, 但是当mtry的值取为所有自变量个数时就是装袋法。 袋外观测得到的错判率比较差。

对测试集进行预报:

pred2 <- predict(bag1, newdata = heart_test)
tab2 <- table(pred2, test.y); tab2
##      test.y
## pred2 Yes No
##   Yes  44 15
##   No   14 76
test.err2 <- (tab2[1,2]+tab2[2,1])/sum(tab2[]); test.err2
## [1] 0.1946309

测试集的错判率约为19%。

对全集用装袋法:

bag1b <- randomForest(
  AHD ~ ., 
  data=Heart, 
  mtry=13, 
  importance=TRUE)
bag1b
## 
## Call:
##  randomForest(formula = AHD ~ ., data = Heart, mtry = 13, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##         OOB estimate of  error rate: 21.21%
## Confusion matrix:
##     Yes  No class.error
## Yes 100  37    0.270073
## No   26 134    0.162500

各变量的重要度数值及其图形:

importance(bag1b)
##                  Yes         No MeanDecreaseAccuracy MeanDecreaseGini
## Age        3.7368005  5.9999867            7.2128941       12.1304971
## Sex        8.4082243 11.6854290           14.2202966        4.6253056
## ChestPain 19.2727462 13.7357878           23.2572070       27.9651052
## RestBP     0.1959288  4.4080821            3.5491959        9.7198756
## Chol      -4.4853304  1.6204588           -1.8363588       11.5630931
## Fbs       -0.9582635  0.5261395           -0.3205312        0.8449148
## RestECG    1.6353427  0.1595635            1.3271811        1.6408234
## MaxHR      2.0318500  8.1264705            7.6843512       13.1458248
## ExAng      5.7030645  1.7850807            5.8043107        3.5962657
## Oldpeak   14.6213004 14.0934301           20.1179974       15.5059594
## Slope      5.6872206  3.5781749            6.3034330        5.5141310
## Ca        18.5244259 25.2918527           30.3846628       22.4696501
## Thal      13.6455796 17.5096952           20.9420003       18.4300152
varImpPlot(bag1b)
北京大学R语言教程(李东风)第43章:基于树的方法

最重要的变量是Thal, ChestPain, Ca。

43.6.4 用随机森林

对训练集用随机森林法:

rf1 <- randomForest(
  AHD ~ ., 
  data = heart_train, 
  importance=TRUE)
rf1
## 
## Call:
##  randomForest(formula = AHD ~ ., data = heart_train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 22.3%
## Confusion matrix:
##     Yes No class.error
## Yes  65 14   0.1772152
## No   19 50   0.2753623

这里mtry取缺省值,对应于随机森林法。

对测试集进行预报:

pred3 <- predict(rf1, newdata = heart_test)
tab3 <- table(pred3, test.y); tab3
##      test.y
## pred3 Yes No
##   Yes  47 15
##   No   11 76
test.err3 <- (tab3[1,2]+tab3[2,1])/sum(tab3[]); test.err3
## [1] 0.1744966

测试集的错判率约为17%。

对全集用随机森林:

rf1b <- randomForest(
  AHD ~ ., 
  data=Heart, 
  importance=TRUE)
rf1b
## 
## Call:
##  randomForest(formula = AHD ~ ., data = Heart, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 18.86%
## Confusion matrix:
##     Yes  No class.error
## Yes 108  29   0.2116788
## No   27 133   0.1687500

各变量的重要度数值及其图形:

importance(rf1b)
##                 Yes          No MeanDecreaseAccuracy MeanDecreaseGini
## Age        4.164290  6.96750571           8.73313795        12.987241
## Sex        7.281438 10.61442721          13.05548914         4.987433
## ChestPain 17.452527 11.48126517          19.04415445        18.341807
## RestBP     0.190149  1.37058415           1.24243793        10.526087
## Chol      -2.275121  1.94668809          -0.03619012        11.468039
## Fbs       -1.664923  2.79680029           0.80530655         1.553999
## RestECG    4.645072 -0.02136297           3.19137194         2.815654
## MaxHR      6.711873  8.57257441          10.49071059        17.820488
## ExAng      9.259580  4.60805172           9.61466086         6.847263
## Oldpeak   13.871594  9.53000298          16.91663034        16.497308
## Slope      8.358886  3.06829267           8.36900777         6.832079
## Ca        18.797958 21.81004086          26.74441512        18.984673
## Thal      14.410041 18.71166223          21.30209724        15.882196
varImpPlot(rf1b)

Heart数据随机森林方法得到的变量重要度

图43.4: Heart数据随机森林方法得到的变量重要度

最重要的变量是Ca, ChestPain。

43.7 附录

43.7.1 Heart数据

knitr::kable(Heart)
AgeSexChestPainRestBPCholFbsRestECGMaxHRExAngOldpeakSlopeCaThalAHD
631typical1452331215002.330fixedNo
671asymptomatic1602860210811.523normalYes
671asymptomatic1202290212912.622reversableYes
371nonanginal1302500018703.530normalNo
410nontypical1302040217201.410normalNo
561nontypical1202360017800.810normalNo
620asymptomatic1402680216003.632normalYes
570asymptomatic1203540016310.610normalNo
631asymptomatic1302540214701.421reversableYes
531asymptomatic1402031215513.130reversableYes
571asymptomatic1401920014800.420fixedNo
560nontypical1402940215301.320normalNo
561nonanginal1302561214210.621fixedYes
441nontypical1202630017300.010reversableNo
521nonanginal1721991016200.510reversableNo
571nonanginal1501680017401.610normalNo
481nontypical1102290016801.030reversableYes
541asymptomatic1402390016001.210normalNo
480nonanginal1302750013900.210normalNo
491nontypical1302660017100.610normalNo
641typical1102110214411.820normalNo
580typical1502831216201.010normalNo
581nontypical1202840216001.820normalYes
581nonanginal1322240217303.212reversableYes
601asymptomatic1302060213212.422reversableYes
500nonanginal1202190015801.620normalNo
580nonanginal1203400017200.010normalNo
660typical1502260011402.630normalNo
431asymptomatic1502470017101.510normalNo
401asymptomatic1101670211412.020reversableYes
690typical1402390015101.812normalNo
601asymptomatic1172301016011.412reversableYes
641nonanginal1403350015800.010normalYes
591asymptomatic1352340016100.520reversableNo
441nonanginal1302330017910.410normalNo
421asymptomatic1402260017800.010normalNo
431asymptomatic1201770212012.520reversableYes
571asymptomatic1502760211210.621fixedYes
551asymptomatic1323530013211.221reversableYes
611nonanginal1502431013711.020normalNo
650asymptomatic1502250211401.023reversableYes
401typical1401990017811.410reversableNo
710nontypical1603020016200.412normalNo
591nonanginal1502121015701.610normalNo
610asymptomatic1303300216900.010normalYes
581nonanginal1122300216502.521reversableYes
511nonanginal1101750012300.610normalNo
501asymptomatic1502430212802.620reversableYes
650nonanginal1404171215700.811normalNo
531nonanginal1301971215201.230normalNo
410nontypical1051980016800.011normalNo
651asymptomatic1201770014000.410reversableNo
441asymptomatic1122900215300.011normalYes
441nontypical1302190218800.010normalNo
601asymptomatic1302530014411.411reversableYes
541asymptomatic1242660210912.221reversableYes
501nonanginal1402330016300.621reversableYes
411asymptomatic1101720215800.010reversableYes
541nonanginal1252730215200.531normalNo
511typical1252130212511.411normalNo
510asymptomatic1303050014211.220reversableYes
460nonanginal1421770216011.430normalNo
581asymptomatic1282160213112.223reversableYes
540nonanginal1353041017000.010normalNo
541asymptomatic1201880011301.421reversableYes
601asymptomatic1452820214212.822reversableYes
601nonanginal1401850215503.020normalYes
541nonanginal1502320216501.610reversableNo
591asymptomatic1703260214013.430reversableYes
461nonanginal1502310014703.620normalYes
650nonanginal1552690014800.810normalNo
671asymptomatic1252541016300.222reversableYes
621asymptomatic120267009911.822reversableYes
651asymptomatic1102480215800.612fixedYes
441asymptomatic1101970217700.011normalYes
650nonanginal1603600215100.810normalNo
601asymptomatic1252580214112.821reversableYes
510nonanginal1403080214201.511normalNo
481nontypical1302450218000.220normalNo
581asymptomatic1502700211110.810reversableYes
451asymptomatic1042080214813.020normalNo
530asymptomatic1302640214300.420normalNo
391nonanginal1403210218200.010normalNo
681nonanginal1802741215011.620reversableYes
521nontypical1203250017200.210normalNo
441nonanginal1402350218000.010normalNo
471nonanginal1382570215600.010normalNo
530asymptomatic1382340216000.010normalNo
510nonanginal1302560214900.510normalNo
661asymptomatic1203020215100.420normalNo
620asymptomatic1601640214506.233reversableYes
621nonanginal1302310014601.823reversableNo
440nonanginal1081410017500.620normalNo
630nonanginal1352520217200.010normalNo
521asymptomatic1282550016110.011reversableYes
591asymptomatic1102390214211.221reversableYes
600asymptomatic1502580215702.622reversableYes
521nontypical1342010015800.811normalNo
481asymptomatic1222220218600.010normalNo
451asymptomatic1152600218500.010normalNo
341typical1181820217400.010normalNo
570asymptomatic1283030215900.011normalNo
710nonanginal1102651213000.011normalNo
491nonanginal1201880013902.023reversableYes
541nontypical1083090015600.010reversableNo
591asymptomatic1401770016210.011reversableYes
571nonanginal1282290215000.421reversableYes
611asymptomatic1202600014013.621reversableYes
391asymptomatic1182190014001.220reversableYes
610asymptomatic1453070214611.020reversableYes
561asymptomatic1252491214411.221normalYes
521typical1181860219000.020fixedNo
430asymptomatic1323411213613.020reversableYes
620nonanginal130263009701.221reversableYes
411nontypical1352030013200.020fixedNo
581nonanginal1402111216500.010normalNo
350asymptomatic1381830018201.410normalNo
631asymptomatic1303301213211.813reversableYes
651asymptomatic1352540212702.821reversableYes
481asymptomatic1302561215010.012reversableYes
630asymptomatic1504070215404.023reversableYes
511nonanginal1002220014311.220normalNo
551asymptomatic1402170011115.630reversableYes
651typical1382821217401.421normalYes
450nontypical1302340217500.620normalNo
560asymptomatic2002881213314.032reversableYes
541asymptomatic1102390012612.821reversableYes
441nontypical1202200017000.010normalNo
620asymptomatic1242090016300.010normalNo
541nonanginal1202580214700.420reversableNo
511nonanginal942270015410.011reversableNo
291nontypical1302040220200.010normalNo
511asymptomatic1402610218610.010normalNo
430nonanginal1222130016500.220normalNo
550nontypical1352500216101.420normalNo
701asymptomatic1451740012512.630reversableYes
621nontypical1202810210301.421reversableYes
351asymptomatic1201980013011.620reversableYes
511nonanginal1252451216602.420normalNo
591nontypical1402210016410.010normalNo
591typical1702880215900.220reversableYes
521nontypical1282051018400.010normalNo
641nonanginal1253090013111.820reversableYes
581nonanginal1052400215410.620reversableNo
471nonanginal1082430015200.010normalYes
571asymptomatic1652891212401.023reversableYes
411nonanginal1122500017900.010normalNo
451nontypical1283080217000.010normalNo
600nonanginal1023180016000.011normalNo
521typical1522981017801.220reversableNo
420asymptomatic1022650212200.620normalNo
670nonanginal1155640216001.620reversableNo
551asymptomatic1602890214510.821reversableYes
641asymptomatic120246029612.231normalYes
701asymptomatic1303220210902.423normalYes
511asymptomatic1402990017311.610reversableYes
581asymptomatic1253000217100.012reversableYes
601asymptomatic1402930217001.222reversableYes
681nonanginal1182770015101.011reversableNo
461nontypical1011971015600.010reversableNo
771asymptomatic1253040216210.013normalYes
540nonanginal1102140015801.620normalNo
580asymptomatic1002480212201.020normalNo
481nonanginal1242551017500.012normalNo
571asymptomatic1322070016810.010reversableNo
540nontypical1322881215910.011normalNo
351asymptomatic1262820215610.010reversableYes
450nontypical1121600013800.020normalNo
701nonanginal1602690011212.921reversableYes
531asymptomatic1422260211110.010reversableNo
590asymptomatic1742490014310.020normalYes
620asymptomatic1403940215701.220normalNo
641asymptomatic1452120213202.022fixedYes
571asymptomatic152274008811.221reversableYes
521asymptomatic1082331014700.113reversableNo
561asymptomatic1321840210512.121fixedYes
431nonanginal1303150016201.911normalNo
531nonanginal1302461217300.013normalNo
481asymptomatic1242740216600.520reversableYes
560asymptomatic1344090215011.922reversableYes
421typical1482440217800.812normalNo
591typical1782700214504.230reversableNo
600asymptomatic1583050216100.010normalYes
630nontypical1401950017900.012normalNo
421nonanginal1202401019400.830reversableNo
661nontypical1602460012010.023fixedYes
541nontypical1922830219500.011reversableYes
691nonanginal1402540214602.023reversableYes
501nonanginal1291960016300.010normalNo
511asymptomatic1402980012214.223reversableYes
620asymptomatic1382941010601.923normalYes
680nonanginal1202110211501.520normalNo
671asymptomatic1002990212510.922normalYes
691typical1602341213100.121normalNo
450asymptomatic1382360215210.220normalNo
500nontypical1202440016201.110normalNo
591typical1602730212500.010normalYes
500asymptomatic1102540215900.010normalNo
640asymptomatic1803250015410.010normalNo
571nonanginal1501261017300.211reversableNo
640nonanginal1403130013300.210reversableNo
431asymptomatic1102110016100.010reversableNo
451asymptomatic1423090214710.023reversableYes
581asymptomatic1282590213013.022reversableYes
501asymptomatic1442000212610.920reversableYes
551nontypical1302620015500.010normalNo
620asymptomatic1502440015411.420normalYes
370nonanginal1202150017000.010normalNo
381typical1202310018213.820reversableYes
411nonanginal1302140216802.020normalNo
660asymptomatic1782281016511.022reversableYes
521asymptomatic1122300016000.011normalYes
561typical1201930216201.920reversableNo
460nontypical1052040017200.010normalNo
460asymptomatic1382430215210.020normalNo
640asymptomatic1303030012202.022normalNo
591asymptomatic1382710218200.010normalNo
410nonanginal1122680217210.010normalNo
540nonanginal1082670216700.010normalNo
390nonanginal941990017900.010normalNo
531asymptomatic123282009512.022reversableYes
630asymptomatic1082690016911.822normalYes
340nontypical1182100019200.710normalNo
471asymptomatic1122040014300.110normalNo
670nonanginal1522770017200.011normalNo
541asymptomatic1102060210810.021normalYes
661asymptomatic1122120213210.111normalYes
520nonanginal1361960216900.120normalNo
550asymptomatic1803270111713.420normalYes
491nonanginal1181490212600.813normalYes
740nontypical1202690212110.211normalNo
540nonanginal1602010016300.011normalNo
541asymptomatic1222860211613.222normalYes
561asymptomatic1302831210311.630reversableYes
461asymptomatic1202490214400.810reversableYes
490nontypical1342710016200.020normalNo
421nontypical1202950016200.010normalNo
411nontypical1102350015300.010normalNo
410nontypical1263060016300.010normalNo
490asymptomatic1302690016300.010normalNo
611typical1342340014502.622normalYes
600nonanginal120178109600.010normalNo
671asymptomatic120237007101.020normalYes
581asymptomatic1002340015600.111reversableYes
471asymptomatic1102750211811.021normalYes
521asymptomatic1252120016801.012reversableYes
621nontypical1282081214000.010normalNo
571asymptomatic1102010012611.520fixedNo
581asymptomatic1462180010502.021reversableYes
641asymptomatic1282630010510.221reversableNo
510nonanginal1202950215700.610normalNo
431asymptomatic1153030018101.220normalNo
420nonanginal1202090017300.020normalNo
670asymptomatic1062230014200.312normalNo
760nonanginal1401970111601.120normalNo
701nontypical1562450214300.010normalNo
571nontypical1242610014100.310reversableYes
440nonanginal1182420014900.321normalNo
580nontypical1363191215200.012normalYes
600typical1502400017100.910normalNo
441nonanginal1202260016900.010normalNo
611asymptomatic1381660212513.621normalYes
421asymptomatic1363150012511.820fixedYes
591nonanginal1262181013402.221fixedYes
401asymptomatic1522230018100.010reversableYes
421nonanginal1301800015000.010normalNo
611asymptomatic1402070213811.911reversableYes
661asymptomatic1602280213802.310fixedNo
461asymptomatic1403110012011.822reversableYes
710asymptomatic1121490012501.620normalNo
591typical1342040016200.812normalYes
641typical1702270215500.620reversableNo
660nonanginal1462780215200.021normalNo
390nonanginal1382200015200.020normalNo
571nontypical1542320216400.011normalYes
580asymptomatic1301970013100.620normalNo
571asymptomatic1103350014313.021reversableYes
471nonanginal1302530017900.010normalNo
550asymptomatic1282050113012.021reversableYes
351nontypical1221920017400.010normalNo
611asymptomatic1482030016100.011reversableYes
581asymptomatic1143180114004.433fixedYes
580asymptomatic1702251214612.822fixedYes
561nontypical1302210216300.010reversableNo
561nontypical1202400016900.030normalNo
671nonanginal1522120215000.820reversableYes
550nontypical1323420016601.210normalNo
441asymptomatic1201690014412.830fixedYes
631asymptomatic1401870214414.012reversableYes
630asymptomatic1241970013610.020normalYes
411nontypical1201570018200.010normalNo
591asymptomatic164176129001.022fixedYes
570asymptomatic1402410012310.220reversableYes
451typical1102640013201.220reversableYes
681asymptomatic1441931014103.422reversableYes
571asymptomatic1301310011511.221reversableYes
570nontypical1302360217400.0

韭菜热线原创版权所有,发布者:风生水起,转载请注明出处:https://www.9crx.com/80025.html

(0)
打赏
风生水起的头像风生水起普通用户
上一篇 2023年12月2日 23:31
下一篇 2023年12月4日 00:48

相关推荐

  • 北京大学R语言教程(李东风)第40章: 随机模拟

    40.1 随机数 随机模拟是统计研究的重要方法, 另外许多现代统计计算方法(如MCMC)也是基于随机模拟的。 R中提供了多种不同概率分布的随机数函数, 可以批量地产生随机数。 一些R扩展包利用了随机模拟方法,如boot包进行bootstrap估计。 所谓随机数,实际是“伪随机数”, 是从一组起始值(称为种子), 按照某种递推算法向前递推得到的。 所…

    2023年11月29日
    11800
  • 日本股市的春天

    日本可能再次繁荣起来。商品和工资通胀正在回归,一些公司正在接受真正的重大企业改革,股市飙升,外国投资者表现出持续的热情。尽管如此,我们认为日本市场的投资者在寻找能够释放价值的公司时仍需要谨慎,因为并非每家公司都充分或热情地迎接变革。 商品和工资通胀之花 我们认为日本经济正在摆脱多年的停滞。我们认为商品和工资通胀的反弹将有助于该国增强经济活力,并引发一段更热、…

    2024年5月16日
    2200
  • 为什么必须优先考虑最大限度地降低投资产品的成本?

    大家都知道投资产品的成本会降低净回报,并将成本视为选择投资构成客户投资组合的重要标准。但是顾问如何定义成本,以及与其他投资组合目标相比,降低成本有多重要? 在实践中,很少有顾问真正了解客户支付的全部费用,以及他们和客户因不必要的开支而损失了多少,通常每年损失 100 个基点或更多。最便宜的基金的表现始终优于昂贵的同类基金,为投资者创造更多财富,并为其顾问创造…

    2023年9月27日
    11600
  • 长期、低息国债很有吸引力

    曲线长端的美国国债利率相对于公允价值模型高出 100 个基点。长期、低息国债的风险调整机会很有吸引力。 自新冠肺炎疫情低点以来,美国和全球的利率急剧上升。美国10年期国债利率已从2020年的略高于0.50%升至如今的接近5%。一个月前,我写了一篇文章,对金融界盛行的高利率说法进行了令人信服的反驳。虽然自我发表文章以来 10 年期利率又上涨了 30 个基点,但…

    2023年11月25日
    12200
  • 北京大学金融时间序列分析讲义第20章: 随机波动率模型

    本章内容来自自(Tsay 2013)§4.13和§4.14内容。 前面的波动率方程中σ2t=Var(at|Ft−1)都是被σt−1,…和at−1,…完全决定。另一种方法是假定σ2t的模型本身有新息,这样的模型称为随机波动率(Stochastic Volatility, SV)模型。模型写成 at=σtεt,(1−α1B−⋯−αmBm)lnσ2t=α0+vt.…

    2023年8月2日
    16400

发表回复

登录后才能评论
客服
客服
关注订阅号
关注订阅号
分享本页
返回顶部