推荐“A Handbook of Statistical Analyses Using R”

denver · 2006年11月21日

areg老师：37楼中meandiffs <- mean(sy[feet]) - mean(sy[metre])应该为meandiffs[ i ]

denver · 2006年11月21日

第三章由于自己很少使用统计推断的内容，所以其统计含义并不是很清楚。其实也没有学习的必要，因为我真正关心的还是与time series和panel data相关的内容。不过还是真的感谢areg老师，在你的带领下，我终于真正意义的开始学习R了，还希望老师再接再厉，我是你永远的fans，呵呵。

areg · 2006年11月21日

[quote]引用第40楼denver于2006-11-21 20:41发表的“”:

areg老师：37楼中meandiffs <- mean(sy[feet]) - mean(sy[metre])应该为meandiffs[ i ][/quote]

谢谢指出，让我能即时回头去改正。

areg · 2006年11月21日

[quote]引用第41楼denver于2006-11-21 21:14发表的“”:

第三章由于自己很少使用统计推断的内容，所以其统计含义并不是很清楚。其实也没有学习的必要，因为我真正关心的还是与time series和panel data相关的内容。不过还是真的感谢areg老师，在你的带领下，我终于真正意义的开始学习R了，还希望老师再接再厉，我是你永远的fans，呵呵。[/quote]

大家共同学习啊，后面的章节你可以先学，只要静下心来，很快就学完了，再回头复习一下，也许不用多长时间你就会很熟练地去找到对你分析数据有用的包并使用好它。

这两天，我没有整理这个手册，是在找另一个更容易让R新手上路的方法。

denver · 2006年11月21日

[quote]引用第43楼areg于2006-11-21 22:22发表的“”:

大家共同学习啊，后面的章节你可以先学，只要静下心来，很快就学完了，再回头复习一下，也许不用多长时间你就会很熟练地去找到对你分析数据有用的包并使用好它。

这两天，我没有整理这个手册，是在找另一个更容易让R新手上路的方法。 [/quote]

期待中…………

areg · 2006年11月22日

#####################################################

#

# CHAPTER 4

# Analysis of Variance: Weight Gain,Foster Feeding in Rats, WaterHardness and Male Egyptian Skulls

#

#####################################################

4.1 Introduction # 手册中无

4.2 Analysis of Variance # 手册中无

#####################################################

#

# CHAPTER 4

# 4.3 Analysis Using R

# 4.3.1 Weight Gain in Rats

#

#####################################################

## 在分析自己数据或得到其它数据时，不要急于进行最终目的的分析，首先看看基本的描述性统计分析，以便根根据数据分布特点，选择恰当的方法，如果数据并不是正态的，选用恰当的转换方式，使数据转换后的数据能得以分析出正确的结果，分析选用方法只是工具，分析过程就是使用工具和思想提升的过程，如果方法和过程都不恰当，结果当然也失去意义，还不如不分析。

## 在我们练习的过程中，我们不只是简单地复制别人的操作过程，练习中的数据也是别人的数据，当我们每练习到某步时，首先对要练习的数据了解一下，多用用edit()，它首先打开数据编辑器，关了数据编辑器后，以执行窗口中以文本显示。

## 首先把相关包的数据提前做个脚本，然后载入

> source("D:\\Rstudy\\Tiro4.R")

> ls()

[1] "foster" "skulls" "water" "weightgain"

>

> edit(weightgain)

source type weightgain

1 Beef Low 90

……………………………………# 略by areg

10 Beef Low 78

11 Beef High 73

……………………………………# 略by areg

20 Beef High 111

21 Cereal Low 107

……………………………………# 略by areg

30 Cereal Low 58

31 Cereal High 98

……………………………………# 略by areg

40 Cereal High 92

> # 所示数据集显示两个字符变量，一个数值变量，先了解其平均数和标准差

> tapply(weightgain$weightgain, list(weightgain$source,weightgain$type), mean)

High Low

Beef 100.0 79.2

Cereal 85.9 83.9

>

> tapply(weightgain$weightgain, list(weightgain$source,weightgain$type), sd)

High Low

Beef 15.13642 13.88684

Cereal 15.02184 15.70881

>

## 通过上面两个表达式，不但了解了数据的基本情况，同时还了解到表达式中函数的应用，第一个列出目标变量，在其后列出因子或水平变量，某种程度上它实现对原数据的拆分统计计算。对上面的数据给我们直观显示，高肉增加体重的幅度较大，谷物类影响不大，低肉的标准差较大(从变异系数来决定).我们对这个数据集进一步进行方差分析。它是两因素的方差分析，第一是资源类型，第二是所用资源的量。

> wg_aov <- aov(weightgain ~ source * type, data = weightgain)

> wg_aov

Call:

aov(formula = weightgain ~ source * type, data = weightgain)

Terms:

source type source:type Residuals

Sum of Squares 220.9 1299.6 883.6 8049.4

Deg. of Freedom 1 1 1 36

Residual standard error: 14.95307

Estimated effects may be unbalanced # 参数估计，它们间的效应不同

>

> coef(wg_aov)

(Intercept) sourceCereal typeLow sourceCereal:typeLow

100.0 -14.1 -20.8 18.8

> plot.design(weightgain)

## 希腊字母复制过来，变样不认识，只好用汉字。

## 注意此模型是限制伽马1=0(对应于肉)和贝塔1=0(对应于高量)，这是因为使用默认的对照处理。从相关系数来看，谷物对小鼠体重增加是负相关，在补给量来看，低补给也是负相关，而在谷物的低补给却是正相关。再结合图来看，谷物只要在最低补给水平之上，补给量对不鼠体重的增加影响不大，不同的资源类型中，物类对小鼠体重增加起决定因素，高量时，小鼠增重明显，而低量，明显满足不了小鼠生长的需求。

> options("contrasts")

$contrasts

unordered ordered

"contr.treatment" "contr.poly"

> ## 因此资源类的相关系统-14.1可以认为是伽马2减伽马1，同理，我们可以限制各个伽马之和等于0。

> coef(aov(weightgain ~ source + type + source:type,data = weightgain, contrasts = list(source = contr.sum)))

(Intercept) source1 typeLow source1:typeLow

92.95 7.05 -11.40 -9.40

>

## 与第一种资源类型（肉类）有正效应（7.05），总体的低补给影响增重(11.4)，资源1类型的低补给影响系数为

(9.40)

> summary(wg_aov)

Df Sum Sq Mean Sq F value Pr(>F)

source 1 220.9 220.9 0.9879 0.32688

type 1 1299.6 1299.6 5.8123 0.02114 *

source:type 1 883.6 883.6 3.9518 0.05447 .

Residuals 36 8049.4 223.6

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

>

## 所使用的符号标记，不同的资源类型影响达到显著性差异type Pr=0.02114(显著性水平0.05)，资源与用量之间也有较大的影响为Pr=0.05447，接近显著性差异水平。

> interaction.plot(weightgain$type, weightgain$source,weightgain$weightgain)

## 图显示非常直观，通过图形探索，回头理解上面各式的每项指标。

#####################################################################

> summary(aov(weightgain ~ type * source, data = weightgain))

Df Sum Sq Mean Sq F value Pr(>F)

type 1 1299.6 1299.6 5.8123 0.02114 *

source 1 220.9 220.9 0.9879 0.32688

type:source 1 883.6 883.6 3.9518 0.05447 .

Residuals 36 8049.4 223.6

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

>

#######################################################################

## 这是此部分分析的最后一步，通过前面每一步的分析，给我们一种启发，数据信息需要层层探讨，而现在现实生活中是许多人不懂分析原理，甚至请别人分析，得到最后一步的结果，就满足了，实际上试验研究本人也没有看明白，往论文中一粘贴，完事！可以说，如是这样的解决问题，可以说，试验白做了。

denver · 2006年11月22日

areg老师：第四章需要哪些数据？HSAUR中并不包含Weightgain数据，你的Tiro4.R中包含哪些内容我也不清楚，请老师明示？

另：怎样才能知道手册中每章包含的数据存放的位置？谢谢

areg · 2006年11月22日

[quote]引用第46楼denver于2006-11-22 18:34发表的“”:

areg老师：第四章需要哪些数据？HSAUR中并不包含Weightgain数据，你的Tiro4.R中包含哪些内容我也不清楚，请老师明示？

另：怎样才能知道手册中每章包含的数据存放的位置？谢谢[/quote]

由于事情太多，可能随要关R系统，所以我常把要用的包和数据做成一个脚本，名字是自己随便起的，如本手册我命名为Tiro

至于HSAUR中到底在多少数据集，用什么函数来显示，我还没有找到，最初我用的最笨办法是翻看要学习的章节，看看有几处data(……，package="HSAUR")，把它们都复制到脚本中去，放在包的后面。当然这种翻着每页找包，很麻烦。

而当我们复习本章的练习时，我可能把命令都集到到一个文本文件或脚本中，再各处去找数据集载入，那更费事了。由于我们所用的windows包是预编包，原数据集找不到，现在我去找到源source包来，解压，这是unix下的包，不了解，但是在它的data文件夹下，能显示各个数据集，我也还没有学会如何打开，不过可以把这些数据集的名一个一个复制出来，放到脚本中去，制成Tiro.R，其内容如下：

######################################################

library("lattice")

library("MASS")

library("scatterplot3d")

library("HSAUR")

data(aspirin, BtheB, Lanza, phosphate, polyps, roomwidth, skulls, toothpaste, waves, BCG, clouds, foster, mastectomy, pistonrings, pottery, schizophrenia2, smoking, voting, weightgain, birthdeathrates, CYGOB1, gardenflowers, meteo, planets, rearrests, schizophrenia, students, water, womensrole, bladdercancer, epilepsy, heptathlon, orallesions, plasma, respiratory, schooldays, suicides, watervoles, package = "HSAUR")

#####################################################

## 注意，那个Forbes2000其实并没有错误，可能是我的机器我设了限制，对那些下面长腿，头上长角的西欧或怪怪的字符，我的电脑不能正常显示而出的错。后来我去查看了Forbes2000，其32、54和883就是一些西欧字符，不是乱码。

## 在HSAUR中的数据集只有Forbes2000较大，为了不占内存，所以在上面列出的数据中，我不把它载入，其它的练习数据集很小，占不了多少内存，图个方便，都给它载入吧。

> source("D:\\Rstudy\\Tiro.R")

> ls()

[1] "aspirin" "BCG" "birthdeathrates" "bladdercancer"

[5] "BtheB" "clouds" "CYGOB1" "epilepsy"

[9] "foster" "gardenflowers" "heptathlon" "Lanza"

[13] "mastectomy" "meteo" "orallesions" "phosphate"

[17] "pistonrings" "planets" "plasma" "polyps"

[21] "pottery" "rearrests" "respiratory" "roomwidth"

[25] "schizophrenia" "schizophrenia2" "schooldays" "skulls"

[29] "smoking" "students" "suicides" "toothpaste"

[33] "voting" "water" "watervoles" "waves"

[37] "weightgain" "womensrole"

>

## 初学R的笨办法，起用就行啊，慢慢提高也许就好解决啦！

denver · 2006年11月22日

多谢老师指点

areg · 2006年11月22日

[quote]引用第48楼denver于2006-11-22 22:28发表的“”:

多谢老师指点[/quote]

实际学习中，许多方法都可以自己改进，不要书说什么，自己就只学什么，适当变通，更容易发现自己学习中不恰当的地方，才能在实质更快地提高自己的学习。

我开始学习R的时候，只会在提示符下输入一些最简单的加减乘除，其它的什么都不会，很让人晕啊……

现在至少是知道怎么去学啦。

由于我们身边往往都没有会用的，所以自学的技能也很重要，交流也是必要的，学习时千万别蛮干！

areg · 2006年11月23日

#####################################################

#

# CHAPTER 4

# 4.3 Analysis Using R

# 4.3.2 Foster Feeding of Rats of Different Genotype

#

#####################################################

> edit(foster)

litgen motgen weight

1 A A 61.5

2 A A 68.2

3 A A 64.0

……………………………………

60 J J 42.0

61 J J 54.0

>

## 当见到foster的数据集基本情况后，首先还是进行最简单的统计来了解基本情况，此两步由areg补充

> tapply(foster$weight, foster$litgen, mean)

A B I J

55.11176 54.66667 52.90714 52.97333

> tapply(foster$weight, foster$motgen, mean)

A B I J

55.4000 58.7000 53.3625 48.6800

> tapply(foster$weight, foster$litgen, sd)

A B I J

8.634370 7.133689 11.273349 5.870808

> tapply(foster$weight, foster$motgen, sd)

A B I J

9.889186 7.242078 6.452790 6.297301

>

## 上面四条表达式列出子代遗传型和来源的母代遗传型类型的每窝小鼠的的体重，有了此步，我们很容易看懂手册中的图4.4，接着进行分析子代与母代在遗传类型中的组合结果，因为同一遗传类型的母代可能有不同类型的子代，同一类型的子代可能来源不同类型的母代。如在子代的类型中，A遗传型最重，从母代来源看，来自B遗传型最重。关于方差方面，大家自己试着描述。

> tapply(foster$weight, list(foster$litgen, foster$motgen), mean)

A B I J

A 63.680 52.40000 54.12500 48.96000

B 52.325 60.64000 53.92500 45.90000

I 47.100 64.36667 51.60000 49.43333

J 54.350 56.10000 54.53333 49.06000

## 从遗传类型分离情况来看，B遗传型产生的子代I和B遗传型以及A遗传型母代产生的子代A遗传型的体重较重，那么从继代培养方面来看，应该首选的是B类型为佳，其次是A类型，并且选与母代遗传型相同的类型进行培养。结合下面的标准差来看，的确如此。

> tapply(foster$weight, list(foster$litgen, foster$motgen), sd)

A B I J

A 3.273683 9.374433 5.321889 8.760594

B 5.533158 5.647389 5.114277 7.636753

I 18.103315 7.124839 8.624964 5.372461

J 5.325098 3.351119 8.376953 5.335541

>

## 标准差分析略，下面回到原作者的手册中。

> summary(aov(weight ~ litgen * motgen, data = foster))

Df Sum Sq Mean Sq F value Pr(>F)

litgen 3 60.16 20.05 0.3697 0.775221

motgen 3 775.08 258.36 4.7632 0.005736 **

litgen:motgen 9 824.07 91.56 1.6881 0.120053

Residuals 45 2440.82 54.24

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

>

> summary(aov(weight ~ motgen * litgen, data = foster))

Df Sum Sq Mean Sq F value Pr(>F)

motgen 3 771.61 257.20 4.7419 0.005869 **

litgen 3 63.63 21.21 0.3911 0.760004

motgen:litgen 9 824.07 91.56 1.6881 0.120053

Residuals 45 2440.82 54.24

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

>

> plot.design(foster)

## 从上面两步的两因素方差分析来看，母代遗传型影响子代最大，达到极显著性差异。进一步用图形来探索，的确如此。适合用的方差分析模型，且第一个模型较为恰当。

> foster_aov <- aov(weight ~ litgen * motgen, data = foster)

> foster_aov

Call:

aov(formula = weight ~ litgen * motgen, data = foster)

Terms:

litgen motgen litgen:motgen Residuals

Sum of Squares 60.1573 775.0806 824.0725 2440.8165

Deg. of Freedom 3 3 9 45

Residual standard error: 7.364806

Estimated effects may be unbalanced

>

## 对不同母代遗传型之间给子代体重的影响进行多重比较，进一步证实我们上面的推断，选用B遗传类型为继代繁值最为理想，如下

> foster_hsd <- TukeyHSD(foster_aov, "motgen")

> foster_hsd

Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = weight ~ litgen * motgen, data = foster)

$motgen

diff lwr upr p adj

B-A 3.330369 -3.859729 10.5204672 0.6078581

I-A -1.895574 -8.841869 5.0507207 0.8853702

J-A -6.566168 -13.627285 0.4949498 0.0767540

I-B -5.225943 -12.416041 1.9641552 0.2266493

J-B -9.896537 -17.197624 -2.5954489 0.0040509

J-I -4.670593 -11.731711 2.3905240 0.3035490

>

> plot(foster_hsd)

>

## 对此部分的分析，学习可能缺乏相关专业背景知识，会有一定困难，但实际上，从统计学的角度，并不难，只需要找出最主要因素就可以。学习过程中，把分析结果表读懂，这是软件使用与统计原理相结合的知识，这部分有一定难度。

areg · 2006年11月23日

#####################################################

#

# CHAPTER 4

# 4.3 Analysis Using R

# 4.3.3 Water Hardness and Mortality

#

#####################################################

> source("D:\\Rstudy\\Tiro4.R")

> ls()

[1] "foster" "skulls" "water" "weightgain"

> edit(water)

…………

> summary(manova(cbind(hardness, mortality) ~ location,data = water), test = "Hotelling-Lawley")

Df Hotelling-Lawley approx F num Df den Df Pr(>F)

location 1 0.9002 26.1062 2 58 8.217e-09 ***

Residuals 59

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

>

## "MS(-1)(res)" "MS(eff)" is a matrix.To reduce it to a test statistic you can choose between

Wilks’^ (LRT)、 Pillai trace、Hotelling-Lawley、 Roy’s greatest root.They all depend on eigenvalues and can be converted toapproximate F tests.在R和S-PLUS默认方法是Pillai’s trace.

## 上式这个公式在文本中，无法写成原样，到相关统计方面去更正吧，(generalized F test).

## 我手中没有这几种方法的公式，希望有这方面资料的朋友给我发个信，以便好好了解。

## 差异概率p值结合Hotelling-Lawley的值都很小，说明两个地区的两个指标的平均值明显不等。从下面两个操作过程中，在两个地区有关水来硬度和矿化度有巨大的差异，南部地区低矿化度高硬度，北部是高矿化度低硬度。

> tapply(water$hardness, water$location, mean)

North South

30.40000 69.76923

>

> tapply(water$mortality, water$location, mean)

North South

1633.600 1376.808

>

areg · 2006年11月23日

#####################################################

#

# CHAPTER 4

# 4.3 Analysis Using R

# 4.3.4 Male Egyptian Skulls

#

#####################################################

> ls()

[1] "foster" "skulls" "water" "weightgain"

>

> edit(skulls)

epoch mb bh bl nh

1 c4000BC 131 138 89 49

2 c4000BC 125 131 92 48

3 c4000BC 131 132 99 50

…………………………………………

150 cAD150 136 133 97 51

>

> means <- aggregate(skulls[, c("mb", "bh", "bl","nh")], list(epoch = skulls$epoch), mean)

> means

epoch mb bh bl nh

1 c4000BC 131.3667 133.6000 99.16667 50.53333

2 c3300BC 132.3667 132.7000 99.06667 50.23333

3 c1850BC 134.4667 133.8000 96.03333 50.56667

4 c200BC 135.5000 132.3000 94.53333 51.96667

5 cAD150 136.1667 130.3333 93.50000 51.36667

>

> pairs(means[, -1], panel = function(x, y) {text(x, y, abbreviate(levels(skulls$epoch)))})

> ## 得到一个散点图矩阵，分别以mb、bh、bl、nh的值为横坐标，对每一横坐标对应的其它三个指标值为纵坐标。在不同纪元间，这四个测量值看出明显不同。下面进一步进行多变量分析。

> skulls_manova <- manova(cbind(mb, bh, bl, nh) ~epoch, data = skulls)

> skulls_manova

Call:

manova(cbind(mb, bh, bl, nh) ~ epoch, data = skulls)

Terms:

epoch Residuals

mb 502.827 3061.067

bh 229.907 3405.267

bl 803.293 3505.967

nh 61.200 1472.133

Deg. of Freedom 4 145

Residual standard error: 4.59465 4.846091 4.917223 3.186321

Estimated effects are balanced

>

> summary(skulls_manova, test = "Pillai")

Df Pillai approx F num Df den Df Pr(>F)

epoch 4 0.3533 3.5120 16 580 4.675e-06 ***

Residuals 145

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

>

> summary(skulls_manova, test = "Wilks")

Df Wilks approx F num Df den Df Pr(>F)

epoch 4.00 0.6636 3.9009 16.00 434.45 7.01e-07 ***

Residuals 145.00

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

>

> summary(skulls_manova, test = "Hotelling-Lawley")

Df Hotelling-Lawley approx F num Df den Df Pr(>F)

epoch 4 0.4818 4.2310 16 562 8.278e-08 ***

Residuals 145

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

>

> summary(skulls_manova, test = "Roy")

Df Roy approx F num Df den Df Pr(>F)

epoch 4 0.4251 15.4097 4 145 1.588e-10 ***

Residuals 145

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

>

## 在五个纪元中，所测得的四个指标总体上明显不同(达到极显著差异)，下面分别探索五个纪元中各项指标，只有nh这项指标在不同纪元间没有出现差异。

> summary.aov(manova(cbind(mb, bh, bl, nh) ~ epoch,data = skulls))

Response mb :

Df Sum Sq Mean Sq F value Pr(>F)

epoch 4 502.83 125.71 5.9546 0.0001826 ***

Residuals 145 3061.07 21.11

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Response bh :

Df Sum Sq Mean Sq F value Pr(>F)

epoch 4 229.9 57.5 2.4474 0.04897 *

Residuals 145 3405.3 23.5

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Response bl :

Df Sum Sq Mean Sq F value Pr(>F)

epoch 4 803.3 200.8 8.3057 4.636e-06 ***

Residuals 145 3506.0 24.2

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Response nh :

Df Sum Sq Mean Sq F value Pr(>F)

epoch 4 61.20 15.30 1.507 0.2032

Residuals 145 1472.13 10.15

>

##针对以上的差异值，进一步对对纪元间与c4oooBC纪元进行多重比较来检验。

> summary(manova(cbind(mb, bh, bl, nh) ~ epoch, data = skulls,subset = epoch %in% c("c4000BC", "c3300BC")))

Df Pillai approx F num Df den Df Pr(>F)

epoch 1 0.02767 0.39135 4 55 0.814

Residuals 58

>

> summary(manova(cbind(mb, bh, bl, nh) ~ epoch, data = skulls,subset = epoch %in% c("c4000BC", "c1850BC")))

Df Pillai approx F num Df den Df Pr(>F)

epoch 1 0.1876 3.1744 4 55 0.02035 *

Residuals 58

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

>

> summary(manova(cbind(mb, bh, bl, nh) ~ epoch, data = skulls,subset = epoch %in% c("c4000BC", "c200BC")))

Df Pillai approx F num Df den Df Pr(>F)

epoch 1 0.3030 5.9766 4 55 0.0004564 ***

Residuals 58

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

>

> summary(manova(cbind(mb, bh, bl, nh) ~ epoch, data = skulls,subset = epoch %in% c("c4000BC", "cAD150")))

Df Pillai approx F num Df den Df Pr(>F)

epoch 1 0.3618 7.7956 4 55 4.736e-05 ***

Residuals 58

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

>

## 通过四对的多重比较，发现距c4000BC越远，所测的四项指标的差异越大。

## 越向后面，操作大家都能看明白了，主要是统计原理与专业知识结合来解释，难度在加大。

#####################################################

#

# CHAPTER 5

# Multiple Linear Regression: Cloud Seeding

#

#####################################################

5.1 Introduction # 手册中无

5.2 Multiple Linear Regression # 手册中无

5.3 Analysis Using R

#####################################################

#

# CHAPTER 5

# Multiple Linear Regression: Cloud Seeding

# 5.3.1 Fitting a Linear Model

#

#####################################################

areg · 2006年11月24日

#####################################################

#

# CHAPTER 5

# Multiple Linear Regression: Cloud Seeding

#

#####################################################

5.1 Introduction # 手册中无

5.2 Multiple Linear Regression # 手册中无

5.3 Analysis Using R

####################################################

> source("D:\\Rstudy\\Tiro.R")

> ls()

[1] "aspirin" "BCG" "birthdeathrates" "bladdercancer"

[5] "BtheB" "clouds" "CYGOB1" "epilepsy"

[9] "foster" "gardenflowers" "heptathlon" "Lanza"

[13] "mastectomy" "meteo" "orallesions" "phosphate"

[17] "pistonrings" "planets" "plasma" "polyps"

[21] "pottery" "rearrests" "respiratory" "roomwidth"

[25] "schizophrenia" "schizophrenia2" "schooldays" "skulls"

[29] "smoking" "students" "suicides" "toothpaste"

[33] "voting" "water" "watervoles" "waves"

[37] "weightgain" "womensrole"

>

####################################################

> source("D:\\Rstudy\\Tiro.R")

> ls()

[1] "aspirin" "BCG" "birthdeathrates" "bladdercancer"

[5] "BtheB" "clouds" "CYGOB1" "epilepsy"

[9] "foster" "gardenflowers" "heptathlon" "Lanza"

[13] "mastectomy" "meteo" "orallesions" "phosphate"

[17] "pistonrings" "planets" "plasma" "polyps"

[21] "pottery" "rearrests" "respiratory" "roomwidth"

[25] "schizophrenia" "schizophrenia2" "schooldays" "skulls"

[29] "smoking" "students" "suicides" "toothpaste"

[33] "voting" "water" "watervoles" "waves"

[37] "weightgain" "womensrole"

>

> edit(clouds)

seeding time cloudcover sne prewetness echomotion rainfall

1 no 0 1.75 13.4 0.274 stationary 12.85

2 yes 1 2.70 37.9 1.267 moving 5.52

3 yes 3 4.10 3.9 0.198 stationary 6.29

4 no 4 2.35 5.3 0.526 moving 6.11

5 yes 6 4.25 7.1 0.250 moving 2.45

6 no 9 1.60 6.9 0.018 stationary 3.61

7 no 18 1.30 4.6 0.307 moving 0.47

8 no 25 3.35 4.9 0.194 moving 4.56

9 no 27 2.85 12.1 0.751 moving 6.35

10 yes 28 2.20 5.2 0.084 moving 5.06

11 yes 29 4.40 4.1 0.236 moving 2.76

12 yes 32 3.10 2.8 0.214 moving 4.05

13 no 33 3.95 6.8 0.796 moving 5.74

14 yes 35 2.90 3.0 0.124 moving 4.84

15 yes 38 2.05 7.0 0.144 moving 11.86

16 no 39 4.00 11.3 0.398 moving 4.45

17 no 53 3.35 4.2 0.237 stationary 3.66

18 yes 55 3.70 3.3 0.960 moving 4.22

19 no 56 3.80 2.2 0.230 moving 1.16

20 yes 59 3.40 6.5 0.142 stationary 5.45

21 yes 65 3.15 3.1 0.073 moving 2.02

22 no 68 3.15 2.6 0.136 moving 0.82

23 yes 82 4.01 8.3 0.123 moving 1.09

24 no 83 4.65 7.4 0.168 moving 0.28

>

####################################################

layout(matrix(1:2, nrow = 2))

bxpseeding <- boxplot(rainfall ~ seeding, data = clouds,ylab = "Rainfall", xlab = "Seeding")

bxpecho <- boxplot(rainfall ~ echomotion, data = clouds,ylab = "Rainfall", xlab = "Echo Motion")

rownames(clouds)[clouds$rainfall %in% c(bxpseeding$out,bxpecho$out)]

[1] "1" "15"

>

## 在降雨量中，第1个和第15个数是特异值，如果不把它们去除，建模过程中将会遇到麻烦。

#####################################################

#

# CHAPTER 5

# Multiple Linear Regression: Cloud Seeding

# 5.3.1 Fitting a Linear Model

#

#####################################################

> clouds_formula <- rainfall ~ seeding * (sne + cloudcover +prewetness + echomotion) + time

> Xstar <- model.matrix(clouds_formula, data = clouds)

> attr(Xstar, "contrasts")

$seeding

[1] "contr.treatment"

$echomotion

[1] "contr.treatment"

>

layout(matrix(1:4, nrow = 2))

plot(rainfall ~ time, data = clouds)

plot(rainfall ~ sne, data = clouds, xlab = "S-NE criterion")

plot(rainfall ~ cloudcover, data = clouds)

plot(rainfall ~ prewetness, data = clouds)

>

######################################################################

> clouds_lm <- lm(clouds_formula, data = clouds)

> class(clouds_lm)

[1] "lm"

> summary(clouds_lm)

Call:

lm(formula = clouds_formula, data = clouds)

Residuals:

Min 1Q Median 3Q Max

-2.5259 -1.1486 -0.2704 1.0401 4.3913

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.34624 2.78773 -0.124 0.90306

seedingyes 15.68293 4.44627 3.527 0.00372 **

sne 0.38786 0.21786 1.780 0.09839 .

cloudcover 0.41981 0.84453 0.497 0.62742

prewetness 4.10834 3.60101 1.141 0.27450

echomotionstationary 3.15281 1.93253 1.631 0.12677

time -0.04497 0.02505 -1.795 0.09590 .

seedingyes:sne -0.48625 0.24106 -2.017 0.06482 .

seedingyes:cloudcover -3.19719 1.26707 -2.523 0.02545 *

seedingyes:prewetness -2.55707 4.48090 -0.571 0.57796

seedingyes:echomotionstationary -0.56222 2.64430 -0.213 0.83492

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.205 on 13 degrees of freedom

Multiple R-Squared: 0.7158, Adjusted R-squared: 0.4972

F-statistic: 3.274 on 10 and 13 DF, p-value: 0.02431

###################################################################

> betastar <- coef(clouds_lm)

> betastar

(Intercept) seedingyes sne

-0.34624093 15.68293481 0.38786207

cloudcover prewetness echomotionstationary

0.41981393 4.10834188 3.15281358

time seedingyes:sne seedingyes:cloudcover

-0.04497427 -0.48625492 -3.19719006

seedingyes:prewetness seedingyes:echomotionstationary

-2.55706696 -0.56221845

>

##################################################################

> Vbetastar <- vcov(clouds_lm)

> Vbetastar

(Intercept) seedingyes sne cloudcover prewetness echomotionstationary

(Intercept) 7.771461004 -7.83267429 -0.1433967355 -1.644958585 -2.55155982 -2.50174305

seedingyes -7.832674285 19.76928191 0.1050443507 2.246894834 1.48189835 2.37629116

sne -0.143396735 0.10504435 0.0474608040 -0.014684744 -0.35944807 -0.16369798

cloudcover -1.644958585 2.24689483 -0.0146847443 0.713230815 -0.31996213 0.47446664

prewetness -2.551559821 1.48189835 -0.3594480676 -0.319962129 12.96725102 3.47784575

echomotionstationary -2.501743048 2.37629116 -0.1636979778 0.474466642 3.47784575 3.73465643

time 0.001210707 -0.03294446 0.0007585525 -0.011905394 0.02115630 0.00248125

seedingyes:sne 0.144605495 -0.26413213 -0.0467034712 0.002798493 0.38057035 0.16617524

seedingyes:cloudcover 1.646502793 -5.13020221 0.0156522477 -0.728415669 0.34694614 -0.47130191

seedingyes:prewetness 2.555576824 -0.23800138 0.3619648683 0.280461233 -12.89705656 -3.46961322

seedingyes:echomotionstationary 2.512665200 -1.86342376 0.1705411103 -0.581868817 -3.28698839 -3.71227233

time seedingyes:sne seedingyes:cloudcover seedingyes:prewetness

(Intercept) 0.0012107067 0.1446054954 1.64650279 2.55557682

seedingyes -0.0329444650 -0.2641321294 -5.13020221 -0.23800138

sne 0.0007585525 -0.0467034712 0.01565225 0.36196487

cloudcover -0.0119053940 0.0027984932 -0.72841567 0.28046123

prewetness 0.0211562957 0.3805703456 0.34694614 -12.89705656

echomotionstationary 0.0024812498 0.1661752379 -0.47130191 -3.46961322

time 0.0006276460 -0.0001319157 0.01270593 -0.01907383

seedingyes:sne -0.0001319157 0.0581099809 0.03013348 -0.59864727

seedingyes:cloudcover 0.0127059314 0.0301334804 1.60547157 -0.94879172

seedingyes:prewetness -0.0190738295 -0.5986472667 -0.94879172 20.07842755

seedingyes:echomotionstationary 0.0031809356 -0.1882948928 0.13638580 4.25948123

seedingyes:echomotionstationary

(Intercept) 2.512665200

seedingyes -1.863423757

sne 0.170541110

cloudcover -0.581868817

prewetness -3.286988387

echomotionstationary -3.712272327

time 0.003180936

seedingyes:sne -0.188294893

seedingyes:cloudcover 0.136385796

seedingyes:prewetness 4.259481228

seedingyes:echomotionstationary 6.992321142

>

###############################################################

> sqrt(diag(Vbetastar))

(Intercept) seedingyes sne

2.78773403 4.44626606 0.21785501

cloudcover prewetness echomotionstationary

0.84452994 3.60100694 1.93252592

time seedingyes:sne seedingyes:cloudcover

0.02505286 0.24106012 1.26707204

seedingyes:prewetness seedingyes:echomotionstationary

4.48089584 2.64429975

>

###############################################################

#####################################################

#

# CHAPTER 5

# Multiple Linear Regression: Cloud Seeding

# 5.3.2 Regression Diagnostics

#

#####################################################

> clouds_resid <- residuals(clouds_lm)

> clouds_resid

1 2 3 4 5 6 7 8 9

2.985715516 -0.509180552 -0.038544387 1.432918478 -0.502229215 -2.213704937 -1.965406758 1.926678615 -1.064419239

10 11 12 13 14 15 16 17 18

-2.525851049 0.985324791 -1.294124009 0.004424687 -0.765383105 4.391315884 -1.146979637 -0.772010579 0.468655269

19 20 21 22 23 24

0.631292095 0.038544387 -1.452856585 1.334901726 1.204328571 -1.153409966

>

#####################################################

> clouds_fitted <- fitted(clouds_lm)

> clouds_fitted

1 2 3 4 5 6 7 8 9 10 11

9.8642845 6.0291806 6.3285444 4.6770815 2.9522292 5.8237049 2.4354068 2.6333214 7.4144192 7.5858510 1.7746752

12 13 14 15 16 17 18 19 20 21 22

5.3441240 5.7355753 5.6053831 7.4686841 5.5969796 4.4320106 3.7513447 0.5287079 5.4114556 3.4728566 -0.5149017

23 24

-0.1143286 1.4334100

>

#####################################################

psymb <- as.numeric(clouds$seeding)

plot(rainfall ~ cloudcover, data = clouds, pch = psymb)

abline(lm(rainfall ~ cloudcover, data = clouds,subset = seeding == "no"))

abline(lm(rainfall ~ cloudcover, data = clouds,subset = seeding == "yes"), lty = 2)

legend("topright", legend = c("No seeding", "Seeding"),pch = 1:2, lty = 1:2, bty = "n")

#####################################################

plot(clouds_fitted, clouds_resid, xlab = "Fitted values",ylab = "Residuals", ylim = max(abs(clouds_resid)) *c(-1, 1), type = "n")

abline(h = 0, lty = 2)

text(clouds_fitted, clouds_resid, labels = rownames(clouds))

#####################################################

qqnorm(clouds_resid, ylab = "Residuals")

qqline(clouds_resid)

####################################################

本章手册中操作不难，难点时统计学原理方面，请结合基础原理学习

areg · 2006年11月24日

#####################################################

#

# CHAPTER 6

# Logistic Regression and Generalised Linear Models: Blood Screening,Women’s Role in Society,and Colonic Polyps

# 我先把各条命令整理出来，大家不明白的地方，再共同探讨，因为学到此阶段，不用再细讲

#####################################################

6.1 Introduction # 手册中无

6.2 Logistic Regression and Generalised Linear Models # 手册中无

6.3 Analysis Using R

#####################################################

#

# CHAPTER 6

# 6.3.1 ESR and Plasma Proteins

#

#####################################################

> source("D:\\Rstudy\\Tiro.R")

> ls()

[1] "aspirin" "BCG" "birthdeathrates" "bladdercancer"

[5] "BtheB" "clouds" "CYGOB1" "epilepsy"

[9] "foster" "gardenflowers" "heptathlon" "Lanza"

[13] "mastectomy" "meteo" "orallesions" "phosphate"

[17] "pistonrings" "planets" "plasma" "plasma_glm_1"

[21] "polyps" "pottery" "rearrests" "respiratory"

[25] "roomwidth" "schizophrenia" "schizophrenia2" "schooldays"

[29] "skulls" "smoking" "students" "suicides"

[33] "toothpaste" "voting" "water" "watervoles"

[37] "waves" "weightgain" "womensrole"

>

###################################################

edit(plasma)

plasma_glm_1 <- glm(ESR ~ fibrinogen, data = plasma,family = binomial())

plasma_glm_1

confint(plasma_glm_1, parm = "fibrinogen")

exp(coef(plasma_glm_1)["fibrinogen"])

exp(confint(plasma_glm_1, parm = "fibrinogen"))

###################################################

layout(matrix(1:2, ncol = 2))

cdplot(ESR ~ fibrinogen, data = plasma)

cdplot(ESR ~ globulin, data = plasma)

##################################################

plasma_glm_2 <- glm(ESR ~ fibrinogen + globulin,data = plasma, family = binomial())

plasma_glm_2

summary(plasma_glm_2)

anova(plasma_glm_1, plasma_glm_2, test = "Chisq")

summary(plasma_glm_1)

#################################################

prob <- predict(plasma_glm_1, type = "response")

prob

plot(globulin ~ fibrinogen, data = plasma, xlim = c(2,6), ylim = c(25, 50), pch = ".")

symbols(plasma$fibrinogen, plasma$globulin, circles = prob,add = TRUE)

################################################

#####################################################

#

# CHAPTER 6

# 6.3.2 Women’s Role in Society

#

#####################################################

############################################

edit(womensrole)

womensrole_glm_1 <- glm(cbind(agree, disagree) ~sex + education, data = womensrole, family = binomial())

womensrole_glm_1

role.fitted1 <- predict(womensrole_glm_1, type = "response")

role.fitted1

###########################################

myplot <- function(role.fitted)

{

f <- womensrole$sex == "Female"

plot(womensrole$education, role.fitted, type = "n",

ylab = "Probability of agreeing", xlab = "Education", ylim = c(0, 1))

lines(womensrole$education[!f], role.fitted[!f],lty = 1)

lines(womensrole$education[f], role.fitted[f],lty = 2)

lgtxt <- c("Fitted (Males)", "Fitted (Females)")

legend("topright", lgtxt, lty = 1:2, bty = "n")

y <- womensrole$agree/(womensrole$agree + womensrole$disagree)

text(womensrole$education, y, ifelse(f, "\\VE","\\MA"), family = "HersheySerif", cex = 1.25)

}

myplot(role.fitted1)

#############################################

womensrole_glm_2 <- glm(cbind(agree, disagree) ~sex * education, data = womensrole, family = binomial())

womensrole_glm_2

summary(womensrole_glm_2)

#############################################

res <- residuals(womensrole_glm_2, type = "deviance")

plot(predict(womensrole_glm_2), res, xlab = "Fitted values",

ylab = "Residuals", ylim = max(abs(res)) * c(-1,1))

abline(h = 0, lty = 2)

#############################################

#####################################################

#

# CHAPTER 6

# 6.3.3 Colonic Polyps

#

#####################################################

> edit(polyps)

######################################################

polyps_glm_1 <- glm(number ~ treat + age, data = polyps,family = poisson())

summary(polyps_glm_1)

polyps_glm_2 <- glm(number ~ treat + age, data = polyps,family = quasipoisson())

summary(polyps_glm_2)

######################################################

areg · 2006年11月24日

#####################################################

#

# CHAPTER 7

# Density Estimation: Erupting Geysers

#

#####################################################

7.1 Introduction # 手册中无

7.2 Density Estimation # 手册中无

7.3 Analysis Using R

#####################################################

#

# CHAPTER 7

# 7.3.1 A Parametric Density Estimate for the Old Faithful Data

#

#####################################################

> source("D:\\Rstudy\\Tiro.R")

> ls()

[1] "aspirin" "BCG" "birthdeathrates" "bladdercancer"

[5] "BtheB" "clouds" "CYGOB1" "epilepsy"

[9] "foster" "gardenflowers" "heptathlon" "Lanza"

[13] "mastectomy" "meteo" "orallesions" "phosphate"

[17] "pistonrings" "planets" "plasma" "plasma_glm_1"

[21] "polyps" "pottery" "rearrests" "respiratory"

[25] "roomwidth" "schizophrenia" "schizophrenia2" "schooldays"

[29] "skulls" "smoking" "students" "suicides"

[33] "toothpaste" "voting" "water" "watervoles"

[37] "waves" "weightgain" "womensrole"

>

#####################################################

logL <- function(param, x)

{

d1 <- dnorm(x, mean = param[2], sd = param[3])

d2 <- dnorm(x, mean = param[4], sd = param[5])

-sum(log(param[1] * d1 + (1 - param[1]) * d2))

}

startparam <- c(p = 0.5, mu1 = 50, sd1 = 3, mu2 = 80,sd2 = 3)

opp <- optim(startparam, logL, x = faithful$waiting,

method = "L-BFGS-B", lower = c(0.01, rep(1,4)),

upper = c(0.99, rep(200, 4)))

opp

####################################################

data("faithful", package = "datasets")

x <- faithful$waiting

layout(matrix(1:3, ncol = 3))

hist(x, xlab = "Waiting times (in min.)", ylab = "Frequency",

probability = TRUE, main = "Gaussian kernel",border = "gray")

lines(density(x, width = 12), lwd = 2)

rug(x)

hist(x, xlab = "Waiting times (in min.)", ylab = "Frequency",

probability = TRUE, main = "Rectangular kernel",border = "gray")

lines(density(x, width = 12, window = "rectangular"),lwd = 2)

rug(x)

hist(x, xlab = "Waiting times (in min.)", ylab = "Frequency",

probability = TRUE, main = "Triangular kernel",border = "gray")

lines(density(x, width = 12, window = "triangular"),lwd = 2)

rug(x)

####################################################

library("KernSmooth")

CYGOB1d <- bkde2D(CYGOB1, bandwidth = sapply(CYGOB1,dpik))

contour(x = CYGOB1d$x1, y = CYGOB1d$x2, z = CYGOB1d$fhat,

xlab = "log surface temperature", ylab = "log light intensity")

###################################################

persp(x = CYGOB1d$x1, y = CYGOB1d$x2, z = CYGOB1d$fhat,

xlab = "log surface temperature", ylab = "log light intensity",

zlab = "estimated density", theta = -35, axes = TRUE,

box = TRUE)

##################################################

library("mclust")

mc <- Mclust(faithful$waiting)

mc

mc$parameters$mean

sqrt(mc$parameters$variance$sigmasq)

##################################################

library("flexmix")

fl <- flexmix(waiting ~ 1, data = faithful, k = 2)

parameters(fl, component = 1)

parameters(fl, component = 2)

#################################################

library("boot")

fit <- function(x, indx)

{

a <- Mclust(x[indx], minG = 2, maxG = 2)$parameters

if (a$pro[1] < 0.5)

return(c(p = a$pro[1], mu1 = a$mean[1],mu2 = a$mean[2]))

return(c(p = 1 - a$pro[1], mu1 = a$mean[2],mu2 = a$mean[1]))

}

#########################

opar <- as.list(opp$par)

rx <- seq(from = 40, to = 110, by = 0.1)

d1 <- dnorm(rx, mean = opar$mu1, sd = opar$sd1)

d2 <- dnorm(rx, mean = opar$mu2, sd = opar$sd2)

f <- opar$p * d1 + (1 - opar$p) * d2

hist(x, probability = TRUE, xlab = "Waiting times (in min.)",

border = "gray", xlim = range(rx), ylim = c(0,0.06), main = "")

lines(rx, f, lwd = 2)

lines(rx, dnorm(rx, mean = mean(x), sd = sd(x)),lty = 2, lwd = 2)

legend(50, 0.06, legend = c("Fitted two-component mixture density",

"Fitted single normal density"), lty = 1:2,bty = "n")

###################################################

bootpara <- boot(faithful$waiting, fit, R = 100) # 此R原等于1000，运行时间太长，学习

bootpara

boot.ci(bootpara, type = "bca", index = 1) # 出错在bootpara中没有a

boot.ci(bootpara, type = "bca", index = 2) # 出错

boot.ci(bootpara, type = "bca", index = 3) # 出错

####################################################

bootplot <- function(b, index, main = "")

{

dens <- density(b$t[, index])

ci <- boot.ci(b, type = "bca", index = index)$bca[4:5]

est <- b$t0[index]

plot(dens, main = main)

y <- max(dens$y)/10

segments(ci[1], y, ci[2], y, lty = 2)

points(ci[1], y, pch = "(")

points(ci[2], y, pch = ")")

points(est, y, pch = 19)

}

layout(matrix(1:2, ncol = 2))

bootplot(bootpara, 2, main = expression(mu[1])) # 出错在bootpara中没有a

bootplot(bootpara, 3, main = expression(mu[2])) # 出错在bootpara中没有a

#################################################

areg · 2006年11月26日

本手册后来的都能正常操作完成，相关的更多方面是统计原理，现在把整个手册相关命令或表达式整理出来，同时把R的提示略去，以便直接复制进行操作，下一步就是边操作边理解，不明白手册中为什么要那样分析，再回到手册去找找解释，这样把R的操作与统计分析两方面结合起来学习：

#########################################################

###############################################

#

# A Handbook of Statistical Analyses Using R

# Brian S. Everitt and Torsten Hothorn

#

###############################################

###############################################

#

# Preface

#

###############################################

search()

[1] ".GlobalEnv" "package:methods" "package:stats"

[4] "package:graphics" "package:grDevices" "package:utils"

[7] "package:datasets" "Autoloads" "package:base"

library("HSAUR")

载入需要的程辑包：lattice

载入需要的程辑包：MASS

载入需要的程辑包：scatterplot3d

search()

[1] ".GlobalEnv" "package:HSAUR" "package:scatterplot3d"

[4] "package:MASS" "package:lattice" "package:methods"

[7] "package:stats" "package:graphics" "package:grDevices"

[10] "package:utils" "package:datasets" "Autoloads"

[13] "package:base"

# 只要你提前安装了所需要的包，系统会自动把需要的包载入,本人把所有的CRAN包都安装了，

# 生物信息学方面的除外。

# 载入包时，要注意，R对大小写敏感，同一字母的大小写是不同的字符，在双引号内，

# 空格也是一个字符，如果你在双引号内有空，而包没有，那将返回不存在某某程序包。

vignette("Ch_introduction_to_R", package = "HSAUR")

# 这将显示该安装内的帮助文件Ch_introduction_to_R，（本例中是一个PDF格式）

edit(vignette("Ch_introduction_to_R", package = "HSAUR")) # 查看该包的短文信息

# 如果想全面看看该包中的介绍,输入下全命令：

vignette(package = "HSAUR")

x <- sqrt(25) + 2

x

print(x)

#################################################

#

# CHAPTER 1

# An Introduction to R

# 1.3 Help and Documentation

#

################################################

help("mean") # 或者用?mean

library("sandwich")

help(package = "sandwich") # 程序包的介绍

vignette("sandwich") # 打开包的使用说明

#################################################

#

# CHAPTER 1

# An Introduction to R

# 1.4 Data Objects in R

#

#################################################

data("Forbes2000", package = "HSAUR") #该包有与系统不兼容的字符，先进行替换

Forbes2000[32,2]<-"FMISN1"

Forbes2000[54,2]<-"FMISN2"

Forbes2000[883,2]<-"FMISN3"

edit(Forbes2000)

str(Forbes2000)

help("Forbes2000")

class(Forbes2000)

dim(Forbes2000)

nrow(Forbes2000)

ncol(Forbes2000)

names(Forbes2000)

class(Forbes2000[, "rank"])

length(Forbes2000[, "rank"])

class(Forbes2000[, "name"])

length(Forbes2000[, "name"])

Forbes2000[, "name"][1]

class(Forbes2000[, "category"])

nlevels(Forbes2000[, "category"])

levels(Forbes2000[, "category"])

table(Forbes2000[, "category"])

class(Forbes2000[, "sales"])

median(Forbes2000[, "sales"])

mean(Forbes2000[, "sales"])

range(Forbes2000[, "sales"])

summary(Forbes2000[, "sales"])

###################################################

#

# CHAPTER 1

# An Introduction to R

# 1.5 Data Import and Export

#

###################################################

csvForbes2000 <- read.table("Forbes2000.csv", header = TRUE,sep = ",", row.names = 1)

# 注意row.names = 1

class(csvForbes2000[, "name"])

csvForbes2000 <- read.table("Forbes2000.csv", header = TRUE,sep = ",",

row.names = 1, colClasses = c("character","integer", "character",

"factor", "factor","numeric", "numeric", "numeric", "numeric"))

class(csvForbes2000[, "name"])

all.equal(csvForbes2000, Forbes2000)

classes <- c("character", "integer", "character","factor",

"factor", "numeric", "numeric", "numeric","numeric")

length(classes)

class(classes)

write.table(Forbes2000, file = "Forbes2000.csv",sep = ",", col.names = NA)

write.table(Forbes2000, file = "Forbes.csv",sep = ",", col.names = NA)

save(Forbes2000, file = "Forbes2000.rda")

list.files(pattern = ".rda")

load("Forbes2000.rda")

###################################################

#

# CHAPTER 1

# An Introduction to R

# 1.6 Basic Data Manipulation

#

####################################################

companies <- Forbes2000[, "name"]

companies[1]

1:3

[1] 1 2 3

companies[1:3]

companies[-(4:2000)]

length(companies)

Forbes2000[1:3, c("name", "sales", "profits", "assets")]

companies <- Forbes2000$name

companies <- Forbes2000[, "name"]

order_sales <- order(Forbes2000$sales)

companies[order_sales[1:3]]

Forbes2000[order_sales[c(2000, 1999, 1998)], c("name","sales", "profits", "assets")]

Forbes2000[Forbes2000$assets > 1000, c("name", "sales","profits", "assets")]

table(Forbes2000$assets > 1000)

na_profits <- is.na(Forbes2000$profits)

table(na_profits)

Forbes2000[na_profits, c("name", "sales", "profits","assets")]

table(complete.cases(Forbes2000))

UKcomp <- subset(Forbes2000, country == "United Kingdom")

dim(UKcomp)

CHcomp<-subset(Forbes2000, country == "China")

dim(CHcomp)

CHcomp

######################################################

#

# CHAPTER 1

# An Introduction to R

# 1.7 Simple Summary Statistics

#

######################################################

summary(Forbes2000)

lapply(Forbes2000, summary)

mprofits <- tapply(Forbes2000$profits, Forbes2000$category,median, na.rm = TRUE)

median(Forbes2000$profits)

median(Forbes2000$profits,na.rm=TRUE)

rev(sort(mprofits))[1:3]

######################################################

#

# 1.7 Simple Summary Statistics

# 1.7.1 Simple Graphics

#

######################################################

fm <- marketvalue ~ sales

class(fm)

layout(matrix(1:2, nrow = 2))

hist(Forbes2000$marketvalue)

hist(log(Forbes2000$marketvalue))

plot(log(marketvalue) ~ log(sales), data = Forbes2000,pch = ".")

plot(marketvalue~sales, data = Forbes2000,pch = ".")

#########################################################

#

# CHAPTER 1

# An Introduction to R

# 1.8 Organising an Analysis

#

#########################################################

#########################################################

#

# CHAPTER 1

# An Introduction to R

# 1.9 Summary

#

#########################################################

boxplot(log(marketvalue) ~ country,

data = subset(Forbes2000,country %in% c("United Kingdom",

"Germany","India", "Turkey")), ylab = "log(marketvalue)",varwidth = TRUE)

#########################################################

#

# CHAPTER 1

# An Introduction to R

# Exercises

#

#########################################################

########## Ex. 1.1：##########

tapply(Forbes2000$profits, Forbes2000$country,median, na.rm = TRUE)

tapply(Forbes2000$profits, Forbes2000$country== "United States",median, na.rm = TRUE)

tapply(Forbes2000$profits, Forbes2000$country== "United Kingdom",median, na.rm = TRUE)

tapply(Forbes2000$profits, Forbes2000$country== "France",median, na.rm = TRUE)

tapply(Forbes2000$profits, Forbes2000$country== "Germany",median, na.rm = TRUE)

######### Ex. 1.2: #########

Gercomp<-subset(Forbes2000, country == "Germany")

table(complete.cases(Gercomp))

Gercomp[Gercomp$profits < 0, c("name","country", "sales","profits", "assets")]

######################################################

Forbes2000[Forbes2000$profits < 0 & Forbes2000$country == "Germany", c("name","country", "sales","profits", "assets")]

######### Ex. 1.3: #########

levels(Forbes2000[, "category"])

myidea<-Forbes2000[Forbes2000$category ==

c("Banking" ,"Business services & supplies" ,"Drugs & biotechnology",

"Food drink & tobacco" ,"Hotels restaurants & leisure" , "Insurance",

"Telecommunications services" ,"Transportation"), c("name","country",

"category", "sales","profits", "assets")]

myidea[myidea$profits>1,c("name","country", "category", "sales","profits", "assets")]

######### Ex. 1.4: #########

ForbesTopp <- subset(Forbes2000, profits != "NA")

length(ForbesTopp$profits)

order_profits <- order(ForbesTopp$profits)

Forbes50<-ForbesTopp[order_profits[c(1995:1945)],

c("name", "country", "sales", "profits", "assets")]

Forbes50

layout(matrix(1:2, nrow = 2))

hist(Forbes50$assets)

hist(Forbes50$sales)

hist(log(Forbes50$assets))

hist(log(Forbes50$sales))

plot(log(sales) ~ log(assets), data = Forbes50,pch = ".")

plot(log(assets) ~ log(sales), data = Forbes50,pch = ".")

######### Ex. 1.5: #########

companies_mean<-tapply(Forbes2000$sales, Forbes2000$country,mean, na.rm = TRUE)

summary(Forbes2000[Forbes2000$profits > 5, c("name","country", "profits")])

rm(list = ls())

######################################################

#

# CHAPTER 2

# Simple Inference:

# Guessing Lengths,Wave Energy, Water Hardness, Piston Rings, and Rearrests of Juveniles

#

######################################################

######################################################

#

# CHAPTER 2

# 2.3.1 Estimating the Width of a Room

#

######################################################

rm(list = ls())

library("HSAUR")

data("roomwidth", package = "HSAUR")

edit(roomwidth)

convert <- ifelse(roomwidth$unit == "feet", 1, 3.28)

convert

tapply(roomwidth$width * convert, roomwidth$unit, summary)

########################################################

convert <- ifelse(roomwidth$unit == "feet", 1, 3.28)

unit<-roomwidth$unit

width<-roomwidth$width * convert

myroom<-data.frame(unit, width)

summary(myroom)

tapply(myroom$width, myroom$unit, summary)

######################################################

layout(matrix(c(1, 2, 1, 3), nrow = 2, ncol = 2,byrow = FALSE))

matrix(c(1, 2, 1, 3), nrow = 2, ncol = 2,byrow = FALSE)

boxplot(I(width * convert) ~ unit, data = roomwidth,

ylab = "Estimated width (feet)", varwidth = TRUE,

names = c("Estimates in feet", "Estimates in metres (converted to feet)"))

feet <- roomwidth$unit == "feet"

qqnorm(roomwidth$width[feet], ylab = "Estimated width (feet)")

qqline(roomwidth$width[feet])

qqnorm(roomwidth$width[!feet], ylab = "Estimated width (metres)")

qqline(roomwidth$width[!feet])

#####################################################

convert <- ifelse(roomwidth$unit == "feet", 1, 3.28)

unit<-roomwidth$unit

width<-roomwidth$width * convert

t.test(width~unit, alternative='two.sided',

conf.level=.95, var.equal=FALSE, data=roomwidth)

convert <- ifelse(roomwidth$unit == "feet", 1, 3.28)

unit<-roomwidth$unit

width<-roomwidth$width * convert

myroom<-data.frame(unit, width)

t.test(width~unit, alternative='two.sided', conf.level=.95, var.equal=FALSE, data=myroom)

t.test(width~unit, alternative='two.sided', conf.level=.95, var.equal=TRUE, data=myroom)

###################################################

convert <- ifelse(roomwidth$unit == "feet", 1, 3.28)

t.test(I(width * convert) ~ unit, data = roomwidth, var.equal = TRUE)

t.test(I(width * convert) ~ unit, data = roomwidth, var.equal = FALSE)

wilcox.test(I(width * convert) ~ unit, data = roomwidth, conf.int = TRUE)

######################################################

#

# CHAPTER 2

# 2.3.2 Wave Energy Device Mooring

#

######################################################

data("waves", package = "HSAUR")

edit(waves)

t.test(waves$method1, waves$method2,

alternative='two.sided', conf.level=.95, paired=TRUE)

#####################################################

boxplot(waves$method1, ylab="method1")

mooringdiff <- waves$method1 - waves$method2

layout(matrix(1:2, ncol = 2))

boxplot(mooringdiff, ylab = "Differences (Newton metres)",main = "Boxplot")

abline(h = 0, lty = 2)

qqnorm(mooringdiff, ylab = "Differences (Newton metres)")

qqline(mooringdiff)

######################################################

#

# CHAPTER 2

# 2.3.3 Mortality and Water Hardness

#

######################################################

library("HSAUR")

data("roomwidth", "waves", "water", "pistonrings", package = "HSAUR")

######################################################

Hist(water$hardness, scale="frequency", breaks="Sturges", col="darkgray")

scatterplot(mortality~hardness, reg.line=lm, smooth=TRUE,

labels=FALSE, boxplots='xy', span =0.5, data=water)

boxplot(water$mortality, ylab="mortality")

#####################################################

nf <- layout(matrix(c(2, 0, 1, 3), 2, 2, byrow = TRUE),c(2, 1), c(1, 2), TRUE)

psymb <- as.numeric(water$location)

plot(mortality ~ hardness, data = water, pch = psymb)

abline(lm(mortality ~ hardness, data = water))

legend("topright", legend = levels(water$location),pch = c(1, 2), bty = "n")

hist(water$hardness)

boxplot(water$mortality)

######################################################

#

# CHAPTER 2

# 2.3.4 Piston-ring Failures

#

######################################################

data("pistonrings", package = "HSAUR")

edit(pistonrings)

chisq.test(pistonrings)$residuals

library(vcd)

assoc(pistonrings)

######################################################

#

# CHAPTER 2

# 2.3.5 Rearrests of Juveniles

#

######################################################

data("rearrests", package = "HSAUR")

edit(rearrests)

mcnemar.test(rearrests, correct = FALSE)

binom.test(rearrests[2], n = sum(rearrests[c(2,3)]))

#####################################################

#

# CHAPTER 3

# Conditional Inference:

# Guessing Lengths, Suicides, Gastrointestinal Damage, and Newborn Infants

#

######################################################

#####################################################

#

# CHAPTER 3

# 3.3 Analysis Using R

# 3.3.1 Estimating the Width of a Room Revised

#

######################################################

rm(list = ls())

library("HSAUR")

data("roomwidth", package = "HSAUR")

convert <- ifelse(roomwidth$unit == "feet", 1, 3.28)

feet <- roomwidth$unit == "feet"

metre <- !feet

y <- roomwidth$width * convert

T <- mean(y[feet]) - mean(y[metre])

T

meandiffs <- double(9999)

for (i in 1:length(meandiffs))

{

sy <- sample(y)

meandiffs[ i ] <- mean(sy[feet]) - mean(sy[metre])

}

# 注意原文[ i ]中i前后没有空格，只因与论坛斜体字符总突后而加了空格

hist(meandiffs)

abline(v = T, lty = 2)

abline(v = -T, lty = 2)

greater <- abs(meandiffs) > abs(T)

mean(greater)

binom.test(sum(greater), length(greater))$conf.int

attr(,"conf.level")

binom.test(sum(greater), length(greater))

library("coin")

independence_test(y ~ unit, data = roomwidth, distribution = "exact")

wilcox_test(y ~ unit, data = roomwidth, distribution = "exact")

#####################################################

#

# CHAPTER 3

# 3.3.3 Gastrointestinal Damages

#

#####################################################

data("Lanza", package = "HSAUR")

edit(Lanza)

xtabs(~treatment + classification + study, data = Lanza)

xtabs()

xtabs(formula = ~., data = parent.frame(), subset, na.action,

exclude = c(NA, NaN), drop.unused.levels = FALSE)

library("coin")

cmh_test(classification ~ treatment, data = Lanza,

scores = list(classification = c(0, 1, 6, 17,30)),

subset = Lanza$study == "I")

cmh_test(classification ~ treatment, data = Lanza,

scores = list(classification = c(0, 1, 6, 17,30)),

subset = Lanza$study == "II")

p <- cmh_test(classification ~ treatment, data = Lanza,

scores = list(classification = c(0, 1, 6, 17,30)),

subset = Lanza$study == "II",distribution = approximate(B = 19999))

p

pvalue(p)

cmh_test(classification ~ treatment, data = Lanza,

scores = list(classification = c(0, 1, 6, 17,30)),

subset = Lanza$study == "III")

cmh_test(classification ~ treatment, data = Lanza,

scores = list(classification = c(0, 1, 6, 17,30)),

subset = Lanza$study == "IV")

cmh_test(classification ~ treatment | study, data = Lanza,

scores = list(classification = c(0, 1, 6, 17,30)))

#####################################################

#

# CHAPTER 3

# 3.3.4 Teratogenesis

#

#####################################################

anomalies <- as.table

(matrix(c(235, 23, 3, 0, 41,35, 8, 0, 20, 11, 11, 1, 2, 1, 3, 1),

ncol = 4,dimnames = list(MD = 0:3, RA = 0:3)))

anomalies

mh_test(anomalies)

mh_test(anomalies, scores = list(c(0, 1, 2, 3)))

###############################################

#####################################################

#

# CHAPTER 4

# Analysis of Variance:

# Weight Gain,Foster Feeding in Rats, WaterHardness and Male Egyptian Skulls

#

#####################################################

#####################################################

#

# CHAPTER 4

# 4.3.1 Weight Gain in Rats

#

#####################################################

rm(list = ls())

library("HSAUR")

data("weightgain", "foster", "water", "skulls", package = "HSAUR")

edit(weightgain)

tapply(weightgain$weightgain, list(weightgain$source,weightgain$type), mean)

tapply(weightgain$weightgain, list(weightgain$source,weightgain$type), sd)

wg_aov <- aov(weightgain ~ source * type, data = weightgain)

plot.design(weightgain)

options("contrasts")

coef(aov(weightgain ~ source + type + source:type,data = weightgain, contrasts = list(source = contr.sum)))

summary(wg_aov)

interaction.plot(weightgain$type, weightgain$source,weightgain$weightgain)

summary(aov(weightgain ~ type * source, data = weightgain))

#####################################################

#

# CHAPTER 4

# 4.3.2 Foster Feeding of Rats of Different Genotype

#

#####################################################

edit(foster)

tapply(foster$weight, foster$litgen, mean)

tapply(foster$weight, foster$motgen, mean)

tapply(foster$weight, foster$litgen, sd)

tapply(foster$weight, foster$motgen, sd)

tapply(foster$weight, list(foster$litgen, foster$motgen), mean)

tapply(foster$weight, list(foster$litgen, foster$motgen), sd)

summary(aov(weight ~ litgen * motgen, data = foster))

summary(aov(weight ~ motgen * litgen, data = foster))

plot.design(foster)

foster_aov <- aov(weight ~ litgen * motgen, data = foster)

foster_aov

foster_hsd <- TukeyHSD(foster_aov, "motgen")

foster_hsd

plot(foster_hsd)

#####################################################

#

# CHAPTER 4

# 4.3.3 Water Hardness and Mortality

#

#####################################################

edit(water)

summary(manova(cbind(hardness, mortality) ~ location,data = water), test = "Hotelling-Lawley")

tapply(water$hardness, water$location, mean)

tapply(water$mortality, water$location, mean)

#####################################################

#

# CHAPTER 4

# 4.3.4 Male Egyptian Skulls

#

#####################################################

edit(skulls)

means <- aggregate(skulls[, c("mb", "bh", "bl","nh")], list(epoch = skulls$epoch), mean)

means

pairs(means[, -1], panel = function(x, y) {text(x, y, abbreviate(levels(skulls$epoch)))})

skulls_manova <- manova(cbind(mb, bh, bl, nh) ~epoch, data = skulls)

skulls_manova

summary(skulls_manova, test = "Pillai")

summary(skulls_manova, test = "Wilks")

summary(skulls_manova, test = "Hotelling-Lawley")

summary(skulls_manova, test = "Roy")

summary.aov(manova(cbind(mb, bh, bl, nh) ~ epoch,data = skulls))

summary(manova(cbind(mb, bh, bl, nh) ~ epoch,

data = skulls,subset = epoch %in% c("c4000BC", "c3300BC")))

summary(manova(cbind(mb, bh, bl, nh) ~ epoch, data = skulls,subset = epoch %in% c("c4000BC", "c1850BC")))

summary(manova(cbind(mb, bh, bl, nh) ~ epoch, data = skulls,subset = epoch %in% c("c4000BC", "c200BC")))

summary(manova(cbind(mb, bh, bl, nh) ~ epoch, data = skulls,subset = epoch %in% c("c4000BC", "cAD150")))

#####################################################

#

# CHAPTER 5

# Multiple Linear Regression: Cloud Seeding

#

#####################################################

rm(list = ls())

library("HSAUR")

library("KernSmooth")

library("mclust")

library("boot")

library("flexmix")

data(aspirin, BtheB, Lanza, phosphate, polyps, roomwidth,

skulls, toothpaste, waves, BCG, clouds, foster, mastectomy,

pistonrings, pottery, schizophrenia2, smoking, voting,

weightgain, birthdeathrates, CYGOB1, gardenflowers, meteo,

planets, rearrests, schizophrenia, students, water, womensrole,

bladdercancer, epilepsy, heptathlon, orallesions, plasma,

respiratory, schooldays, suicides, watervoles, package = "HSAUR")

####################################################

edit(clouds)

layout(matrix(1:2, nrow = 2))

bxpseeding <- boxplot(rainfall ~ seeding, data = clouds,

ylab = "Rainfall", xlab = "Seeding")

bxpecho <- boxplot(rainfall ~ echomotion, data = clouds,

ylab = "Rainfall", xlab = "Echo Motion")

rownames(clouds)[clouds$rainfall %in% c(bxpseeding$out,bxpecho$out)]

#####################################################

#

# CHAPTER 5

# 5.3.1 Fitting a Linear Model

#

#####################################################

clouds_formula <- rainfall ~ seeding *

(sne + cloudcover +prewetness + echomotion) + time

Xstar <- model.matrix(clouds_formula, data = clouds)

attr(Xstar, "contrasts")

layout(matrix(1:4, nrow = 2))

plot(rainfall ~ time, data = clouds)

plot(rainfall ~ sne, data = clouds, xlab = "S-NE criterion")

plot(rainfall ~ cloudcover, data = clouds)

plot(rainfall ~ prewetness, data = clouds)

clouds_lm <- lm(clouds_formula, data = clouds)

class(clouds_lm)

summary(clouds_lm)

betastar <- coef(clouds_lm)

betastar

#####################################################

Vbetastar <- vcov(clouds_lm)

Vbetastar

sqrt(diag(Vbetastar))

#####################################################

#

# CHAPTER 5

# 5.3.2 Regression Diagnostics

#

#####################################################

clouds_resid <- residuals(clouds_lm)

clouds_resid

clouds_fitted <- fitted(clouds_lm)

clouds_fitted

psymb <- as.numeric(clouds$seeding)

plot(rainfall ~ cloudcover, data = clouds, pch = psymb)

abline(lm(rainfall ~ cloudcover,

data = clouds,subset = seeding == "no"))

abline(lm(rainfall ~ cloudcover,

data = clouds,subset = seeding == "yes"), lty = 2)

legend("topright", legend = c("No seeding", "Seeding"),

pch = 1:2, lty = 1:2, bty = "n")

#####################################################

plot(clouds_fitted, clouds_resid, xlab = "Fitted values",

ylab = "Residuals", ylim = max(abs(clouds_resid)) *

c(-1, 1), type = "n")

abline(h = 0, lty = 2)

text(clouds_fitted, clouds_resid, labels = rownames(clouds))

#####################################################

qqnorm(clouds_resid, ylab = "Residuals")

qqline(clouds_resid)

####################################################

areg · 2006年11月26日

denver · 2006年11月26日

areg老师，你这进度真叫一个快呀。这几天被别的事情缠住了，回来一看，你帧出了这么多，简直有点跟不上了。努力中……

denver · 2006年11月27日

终于看完了，好多地方还是晕晕乎乎的。师父领进门，修行在个人，还得自己好好努力啊。