自制分段计数函数遇到问题

Heterogeneity · 2023年3月23日

【这个标题暂时没有很好地表达我所遇到的疑难；可能在问题解决以后，回过头再看，可以起一个更好的标题。】

大家好，最近因为探索性数据分析的需要，自己写了一个分段统计数值分布的函数。目的是当对某个连续性数值变量以给定的标准进行分段时，输出每一段里的样本点个数和其所出现的比例。

简单的情形

比较简单的情形下，给定数据、给定分类标准就好了，我的函数是这么写的：

library(tidyverse)
library(magrittr)

# 研究需要，采用的是一个对数形式的分段方案
log10_breaks=c(-Inf, -100, -10, -1, -0.1, 0, 0.1, 1, 10, 100, Inf)

# 函数本体，其实只有一句话

# Get the value distribution by ranges of a continuous variable.
# x is a vector of numbers and it is the variable of interest.
# break is a vector of numbers stating the thresholds of each range.
# The output is a tibble of three columns: range, number of observations, and the proportions of observations.
get_value_distribution=function(x, breaks=log10_breaks)
{
  cut(x, breaks) %>% table() %>% as_tibble() %>% rename(., range=`.`, freq='n') %>% mutate(prop=freq/sum(freq))
}

这个函数没有问题，运行所得的结果就是我预期的结果。举个例子：

get_value_distribution(mtcars$disp)

此时会得到输出

# A tibble: 10 × 3
   range        freq  prop
   <chr>       <int> <dbl>
 1 (-Inf,-100]     0 0    
 2 (-100,-10]      0 0    
 3 (-10,-1]        0 0    
 4 (-1,-0.1]       0 0    
 5 (-0.1,0]        0 0    
 6 (0,0.1]         0 0    
 7 (0.1,1]         0 0    
 8 (1,10]          0 0    
 9 (10,100]        5 0.156
10 (100, Inf]     27 0.844

分类讨论的情形

很多情况下我会关心一个离散型变量对一个连续性数值变量的数值分布影响。于是我在上述函数的基础上，写了一个更复杂的函数，让我可以用一个离散型变量首先对连续型数值变量进行分情况讨论，在离散型变量分别取每一个值的时候，都可以输出连续性数值变量的分布。
我的函数是这么写的：

# Get the value distribution by ranges of a continuous variable, sliced by another discrete variable.
# dataset is a data frame or tibble, containing both the continuous variable to be sliced, and the discrete variable serving as the slicer.
# value is the variable in the dataset to be sliced.
# class is the variable in the dataset serving as the slicer.
# break is a vector of numbers stating the thresholds of each range.
# The output is a list of several tibbles, with each tibble containing three columns: range, number of observations, and the proportions of observations.
# The number of the tibbles equal to the unique values appeared in the slicer variable.
get_distributions=function(dataset, value, class='none', breaks=log10_breaks)
{
  # Convert incoming parameters 'value' and 'class' into strings
  value=deparse(substitute(value))
  class=deparse(substitute(class))
  
  # If a 'class' variable is not specified
  if (class=='none')
  {
    get_value_distribution(dataset[[value]], breaks)
  }
  else
  {
    # List all categories in the 'class' variable
    unique(dataset[[class]]) -> classes
      
    # Slice the data set into several pieces according to the classes (or 'labels')
    lapply(classes, function(x) dataset[dataset[[class]]==x, ]) %>%
      
    # Only keep the variable of interest
    lapply(., function(x) x[[value]]) %>%
      
    # Call get_value_distribution() to process
    lapply(., get_value_distribution, breaks) -> results
    
    # Attach a name to each table generated
    names(results)=classes
    
    # Output
    results
  }
}

这个函数在我指定了离散型变量的时候，运行也是正常的。举个例子：

get_distributions(mtcars, disp, gear)

此时会得到输出

$`4`
# A tibble: 10 × 3
   range        freq  prop
   <chr>       <int> <dbl>
 1 (-Inf,-100]     0 0    
 2 (-100,-10]      0 0    
 3 (-10,-1]        0 0    
 4 (-1,-0.1]       0 0    
 5 (-0.1,0]        0 0    
 6 (0,0.1]         0 0    
 7 (0.1,1]         0 0    
 8 (1,10]          0 0    
 9 (10,100]        4 0.333
10 (100, Inf]      8 0.667

$`3`
# A tibble: 10 × 3
   range        freq  prop
   <chr>       <int> <dbl>
 1 (-Inf,-100]     0     0
 2 (-100,-10]      0     0
 3 (-10,-1]        0     0
 4 (-1,-0.1]       0     0
 5 (-0.1,0]        0     0
 6 (0,0.1]         0     0
 7 (0.1,1]         0     0
 8 (1,10]          0     0
 9 (10,100]        0     0
10 (100, Inf]     15     1

$`5`
# A tibble: 10 × 3
   range        freq  prop
   <chr>       <int> <dbl>
 1 (-Inf,-100]     0   0  
 2 (-100,-10]      0   0  
 3 (-10,-1]        0   0  
 4 (-1,-0.1]       0   0  
 5 (-0.1,0]        0   0  
 6 (0,0.1]         0   0  
 7 (0.1,1]         0   0  
 8 (1,10]          0   0  
 9 (10,100]        1   0.2
10 (100, Inf]      4   0.8

我的问题

当我使用get_distributions()函数但是不指定离散型变量时，我的本意是输出与get_value_distribution()几乎一样的结果。这也是为什么我会把get_distributions()函数中class参数的默认值设为'none'的原因。
但是当我想这么使用的时候：

get_distributions(mtcars, disp)

得到的结果却是

list()

请问我应该如何修正代码，以得到预期的结果？

fenguoerbian · 2023年3月23日

class=deparse(substitute(class))

你的设计里希望class这个变量用户可以直接写变量名而不需要引号去括起来，即例子里的gear，但是你的class的默认值却是字符串’none‘，经过你这里的处理就class中实际存储的就变成字符串"\"none\""了。

我个人建议的话你可以把class不设置默认值，同时函数内先用if判断missing(class)的结果来作出分支，如果用户确实提供了class，那么在这个处理分支里再用

class=deparse(substitute(class))

yydhcl · 2023年3月23日

参数里class = none 试试

Heterogeneity · 2023年3月23日

fenguoerbian 果然旁观者清！

Heterogeneity · 2023年3月23日

最后按照 fenguoerbian 的指点，get_distributions()这个函数写为

get_distributions=function(dataset, value, class, breaks=log10_breaks)
{
  # Convert incoming parameter 'value' into strings
  value=deparse(substitute(value))

  # If a 'class' variable is not specified
  if (missing(class))
  {
    get_value_distribution(dataset[[value]], breaks)
  }
  else
  {
    # Convert incoming parameter 'class' into strings
    class=deparse(substitute(class))
    
    # List all categories in the 'class' variable
    unique(dataset[[class]]) -> classes
      
    # Slice the data set into several pieces according to the classes (or 'labels')
    lapply(classes, function(x) dataset[dataset[[class]]==x, ]) %>%
      
    # Only keep the variable of interest
    lapply(., function(x) x[[value]]) %>%
      
    # Call get_value_distribution() to process
    lapply(., get_value_distribution, breaks) -> results
    
    # Attach a name to each table generated
    names(results)=classes
    
    # Output
    results
  }
}

经测试，可以实现我一开始预期的输出。