【这个标题暂时没有很好地表达我所遇到的疑难;可能在问题解决以后,回过头再看,可以起一个更好的标题。】
大家好,最近因为探索性数据分析的需要,自己写了一个分段统计数值分布的函数。目的是当对某个连续性数值变量以给定的标准进行分段时,输出每一段里的样本点个数和其所出现的比例。
简单的情形
比较简单的情形下,给定数据、给定分类标准就好了,我的函数是这么写的:
library(tidyverse)
library(magrittr)
# 研究需要,采用的是一个对数形式的分段方案
log10_breaks=c(-Inf, -100, -10, -1, -0.1, 0, 0.1, 1, 10, 100, Inf)
# 函数本体,其实只有一句话
# Get the value distribution by ranges of a continuous variable.
# x is a vector of numbers and it is the variable of interest.
# break is a vector of numbers stating the thresholds of each range.
# The output is a tibble of three columns: range, number of observations, and the proportions of observations.
get_value_distribution=function(x, breaks=log10_breaks)
{
cut(x, breaks) %>% table() %>% as_tibble() %>% rename(., range=`.`, freq='n') %>% mutate(prop=freq/sum(freq))
}
这个函数没有问题,运行所得的结果就是我预期的结果。举个例子:
get_value_distribution(mtcars$disp)
此时会得到输出
# A tibble: 10 × 3
range freq prop
<chr> <int> <dbl>
1 (-Inf,-100] 0 0
2 (-100,-10] 0 0
3 (-10,-1] 0 0
4 (-1,-0.1] 0 0
5 (-0.1,0] 0 0
6 (0,0.1] 0 0
7 (0.1,1] 0 0
8 (1,10] 0 0
9 (10,100] 5 0.156
10 (100, Inf] 27 0.844
分类讨论的情形
很多情况下我会关心一个离散型变量对一个连续性数值变量的数值分布影响。于是我在上述函数的基础上,写了一个更复杂的函数,让我可以用一个离散型变量首先对连续型数值变量进行分情况讨论,在离散型变量分别取每一个值的时候,都可以输出连续性数值变量的分布。
我的函数是这么写的:
# Get the value distribution by ranges of a continuous variable, sliced by another discrete variable.
# dataset is a data frame or tibble, containing both the continuous variable to be sliced, and the discrete variable serving as the slicer.
# value is the variable in the dataset to be sliced.
# class is the variable in the dataset serving as the slicer.
# break is a vector of numbers stating the thresholds of each range.
# The output is a list of several tibbles, with each tibble containing three columns: range, number of observations, and the proportions of observations.
# The number of the tibbles equal to the unique values appeared in the slicer variable.
get_distributions=function(dataset, value, class='none', breaks=log10_breaks)
{
# Convert incoming parameters 'value' and 'class' into strings
value=deparse(substitute(value))
class=deparse(substitute(class))
# If a 'class' variable is not specified
if (class=='none')
{
get_value_distribution(dataset[[value]], breaks)
}
else
{
# List all categories in the 'class' variable
unique(dataset[[class]]) -> classes
# Slice the data set into several pieces according to the classes (or 'labels')
lapply(classes, function(x) dataset[dataset[[class]]==x, ]) %>%
# Only keep the variable of interest
lapply(., function(x) x[[value]]) %>%
# Call get_value_distribution() to process
lapply(., get_value_distribution, breaks) -> results
# Attach a name to each table generated
names(results)=classes
# Output
results
}
}
这个函数在我指定了离散型变量的时候,运行也是正常的。举个例子:
get_distributions(mtcars, disp, gear)
此时会得到输出
$`4`
# A tibble: 10 × 3
range freq prop
<chr> <int> <dbl>
1 (-Inf,-100] 0 0
2 (-100,-10] 0 0
3 (-10,-1] 0 0
4 (-1,-0.1] 0 0
5 (-0.1,0] 0 0
6 (0,0.1] 0 0
7 (0.1,1] 0 0
8 (1,10] 0 0
9 (10,100] 4 0.333
10 (100, Inf] 8 0.667
$`3`
# A tibble: 10 × 3
range freq prop
<chr> <int> <dbl>
1 (-Inf,-100] 0 0
2 (-100,-10] 0 0
3 (-10,-1] 0 0
4 (-1,-0.1] 0 0
5 (-0.1,0] 0 0
6 (0,0.1] 0 0
7 (0.1,1] 0 0
8 (1,10] 0 0
9 (10,100] 0 0
10 (100, Inf] 15 1
$`5`
# A tibble: 10 × 3
range freq prop
<chr> <int> <dbl>
1 (-Inf,-100] 0 0
2 (-100,-10] 0 0
3 (-10,-1] 0 0
4 (-1,-0.1] 0 0
5 (-0.1,0] 0 0
6 (0,0.1] 0 0
7 (0.1,1] 0 0
8 (1,10] 0 0
9 (10,100] 1 0.2
10 (100, Inf] 4 0.8
我的问题
当我使用get_distributions()
函数但是不指定离散型变量时,我的本意是输出与get_value_distribution()
几乎一样的结果。这也是为什么我会把get_distributions()
函数中class
参数的默认值设为'none'
的原因。
但是当我想这么使用的时候:
get_distributions(mtcars, disp)
得到的结果却是
list()
请问我应该如何修正代码,以得到预期的结果?