Package summary----Machine Learning & Statistical Learning with R

graceli

Several add-on packages implement ideas and methods developed at the borderline between computer science and statistics - this field of research is usually referred to as machine learning. The packages can be roughly structured into the following topics:

*Neural Networks : Single-hidden-layer neural network are implemented in package nnet as part of the VR bundle (shipped with base R).

*Recursive Partitioning : Tree-structured models for regression, classification and survival analysis, following the ideas in the CART book, are implemented in rpart (shipped with base R) and tree. Package rpart is recommended for computing CART-like trees. A rich toolbox of partitioning algorithms is available in Weka , package RWeka provides an interface to this implementation, including the J4.8-variant of C4.5 and M5.

Two recursive partitioning algorithms with unbiased variable selection and statistical stopping criterion are implemented in package party. Function ctree() is based on non-parametrical conditional inference procedures for testing independence between response and each input variable whereas mob() can be used to partition parametric models. Extensible tools for visualizing binary trees and node distributions of the response are available in package party as well.

An adaptation of rpart for multivariate responses is available in package mvpart. The validity of trees can be investigated via permutation approaches with package rpart.permutation and a tree algorithm fitting nearest neighbors in each node is implemented in package knnTree. For problems with binary input variables the package LogicReg implements logic regression. Graphical tools for the visualization of trees are available in packages maptree and pinktoe.

*Random Forests : The reference implementation of the random forest algorithm for regression and classification is available in package randomForest. Package ipred has bagging for regression, classification and survival analysis as well as bundling, a combination of multiple models via ensemble learning. In addition, a random forest variant based on conditional inference trees is implemented in package party. The varSelRF package focuses on variable selection by means for random forest algorithms.

*Regularized and Shrinkage Methods : Regression models with some constraint on the parameter estimates can be fitted with the lasso2 and lars packages. The solutions for all values of the shrinkage parameter can be simultaneously computed using the functionality in package elasticnet. The L1 regularization path for generalized linear models and Cox models can be obtained from functions available in package glmpath. The shrunken centroids classifier and utilities for gene expression analyses are implemented in package pamr.

*Boosting : Various forms of gradient boosting are implemented in packages gbm (tree-based functional gradient descent boosting) and boost (including LogitBoost and L2Boost). Package GAMBoost can be used to fit generalized additive models by a boosting algorithm. An extensible boosting framework for generalized linear, additive and nonparametric models is available in package mboost.

*Support Vector Machines : The function svm() from e1071 offers an interface to the LIBSVM library and package kernlab implements a flexible framework for kernel learning (including SVMs, RVMs and other kernel learning algorithms). An interface to the SVMlight implementation (only for one-against-all classification) is provided in package klaR.

*Bayesian Methods : Bayesian Additive Regression Trees (BART), where the final model is defined in terms of the sum over many weak learners (not unlike ensemble methods), are implemented in package BayesTree.

*Optimization using Genetic Algorithms Packages gafit and rgenoud offer optimization routines based on genetic algorithms.

*Association Rules : Package arules provides both data structures for efficient handling of sparse binary data as well as interfaces to implementations of Apriori and Eclat for mining frequent itemsets, maximal frequent itemsets, closed frequent itemsets and association rules.

*Model selection and validation : Package e1071 has function tune() for hyper parameter tuning and function errorest() (ipred) can be used for error rate estimation. The cost parameter C for support vector machines can be chosen utilizing the functionality of package svmpath. Functions for ROC analysis and other visualisation techniques for comparing candidate classifiers are available from package ROCR.

*Elements of Statistical Learning : Data sets, functions and examples from the book The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Friedman have been packaged and are available as ElemStatLearn.

yihui

Good

rtist

嗬嗬，这是task view啊。

yihui

嗯，http://cran.r-project.org/src/contrib/Views/

bjt

国内有没有介绍 Recursive Partitioning 的书籍？看 E 文太费力气了

yihui

给《统计与信息论坛》投了一篇关于rpart的文章，不知道能不能发

neige

I am looking for R codes for the book

Trevor Hastie, Robert Tibshirani, Jerome Friedman. The elements of statistical learning. Springer, 2001

anyone has it?

yihui

Refer to http://www-stat.stanford.edu/~tibs/ElemStatLearn/

neige

thanks, I was there. But I need more than that...

I heard there is a package that can generate all graphs in the book

anyone read this book? I hate it

eagle_7621

very thanks to graceli

yihui

There's a package named "ElemStatLearn" in R, but there doesn't seem to be functions for ALL the graphics in the book. (mainly it's just a package for data used in the book)

neige

thanks, I will check it out