如何从字符串类型的日期数据中分别提取年份、月份

Cloud2016

如题，部分数据集展示如下，我现在能想到的分离办法就是写正则表达式，还有没有更方便的办法？

> head(eq)
             发震时刻 震级(M) 纬度(°) 经度(°) 深度(千米)         参考位置
1 2019-06-30 21:44:44     3.0   27.56  112.10          6 湖南娄底市双峰县
2 2019-06-30 21:32:29     3.1   28.44  104.81          8 四川宜宾市长宁县
3 2019-06-30 12:14:25     3.0   28.43  104.77          9   四川宜宾市珙县
4 2019-06-30 03:44:11     4.8   22.43  122.31         30   台湾台东县海域
5 2019-06-30 03:09:29     3.1   31.00   98.96          7 四川甘孜州白玉县
6 2019-06-30 03:07:58     3.1   31.02   98.96          8 四川甘孜州白玉县
> str(eq)
'data.frame':   6917 obs. of  6 variables:
 $ 发震时刻  : chr  "2019-06-30 21:44:44" "2019-06-30 21:32:29" "2019-06-30 12:"..
 $ 震级(M)   : num  3 3.1 3 4.8 3.1 3.1 3.1 3 3.4 4 ...
 $ 纬度(°)   : num  27.6 28.4 28.4 22.4 31 ...
 $ 经度(°)   : num  112 105 105 122 99 ...
 $ 深度(千米): num  6 8 9 30 7 8 22 6 8 6 ...
 $ 参考位置  : chr  "湖南娄底市双峰县" "四川宜宾市长宁县" "四川宜宾市珙县" "台湾台东县海域" ...

dapengde

Tidyverse 版：

> lubridate::year('2019-06-30 21:44:44')
[1] 2019
> lubridate::month('2019-06-30 21:44:44')
[1] 6

Non-Tidyverse 版：

x <- strptime('2019-06-30 21:44:44', format = '%Y-%m-%d %H:%M:%S')
format(x, '%Y')
format(x, '%m')

Cloud2016

我原以为正则表达式中启用 perl 实现的命名捕捉功能能获取名称和对应的值，我不知道是我不会用还是没这功能？

dapengde 你的非Non-Tidyverse 版真强，冒着打脸的风险，我把正则表达式的版本放出来，这个版本的唯一优势是面对不那么齐整的日期格式，可以加规则调整，实际数据往往比较脏，不过在我这个数据集中还好。除了出现这样的情况外 2019-6-3 21:44:44 目前还没看到更糟糕的情况，而且这种情况 strptime 仍然处理的好，再给刷个 🚀

extract_date <- function(x) {
  m <- regexec("((?<year>(\\d{4}))-(?<month>(\\d{1,2}))-(?<day>(\\d{1,2})))", x, perl = TRUE) 
  parts <- do.call(rbind,
                   lapply(regmatches(x, m), `[`, c(2L, 4L, 6L, 8L)))
  colnames(parts) <- c("date", "year", "month", "day")
  parts
}

equake <- cbind(eq, extract_date(eq[, 1]))
head(equake)

             发震时刻 震级(M) 纬度(°) 经度(°) 深度(千米)         参考位置
1 2019-06-30 21:44:44     3.0   27.56  112.10          6 湖南娄底市双峰县
2 2019-06-30 21:32:29     3.1   28.44  104.81          8 四川宜宾市长宁县
3 2019-06-30 12:14:25     3.0   28.43  104.77          9   四川宜宾市珙县
4 2019-06-30 03:44:11     4.8   22.43  122.31         30   台湾台东县海域
5 2019-06-30 03:09:29     3.1   31.00   98.96          7 四川甘孜州白玉县
6 2019-06-30 03:07:58     3.1   31.02   98.96          8 四川甘孜州白玉县
        date year month day
1 2019-06-30 2019    06  30
2 2019-06-30 2019    06  30
3 2019-06-30 2019    06  30
4 2019-06-30 2019    06  30
5 2019-06-30 2019    06  30
6 2019-06-30 2019    06  30

xieshichen

data.table 版本

library(data.table)
setDT(eq)[, `:=`(y = year(发震时刻), m=month(发震时刻))]
eq

dapengde

Cloud2016 你这个操作真是大脑体操啊……正则好恐怖，读比写更难……

Liechi

Cloud2016
“非Non-Tidyverse”？

这表达的曲折跟沃森和克里克 DNA 双螺旋论文里那句著名的“It has not escaped our notice ...”有的一拼了。不过我估计你是多打了一字：）

Jiena

Cloud2016

可以来玩玩这个 dtverse 包，除了切年月日也可以用来切姓和名

library(dtverse)
data("dt_dates")
dt_dates$Start_Date <- as.character(dt_dates$Start_Date)
dtverse::str_split_col(dt_dates,
                       by_col = "Start_Date",
                       by_pattern = "-",
                       match_to_names = c("Year", "Month", "Day"))
 
   Start_Date   End_Date      Full_name First Name Last Name Year Month Day
1: 2019-05-01 2019-06-01     Joe, Smith        Joe     Smith 2019    05  01
2: 2019-08-04 2019-08-09 Alex, Robinson       Alex  Robinson 2019    08  04
3: 2019-07-05 2019-08-14     David, Big      David       Big 2019    07  05
4: 2019-07-04 2019-07-05     Julia, Joe      Julia       Joe 2019    07  04
5: 2019-04-27 2019-05-10  Jessa, Oliver      Jessa    Oliver 2019    04  27

Cloud2016

Liechi 确实多敲了一个字
dapengde 正则确实比较难懂

dapengde

Liechi 双重否定表否定的英式中文表达法。一定是这样的。完全不影响读者理解。

Liechi

dapengde 双重否定表否定这种风骚的操作我只在歌里听过：I ain't got no money。

tctcab

Cloud2016

正则并不特别适合处理日期类的字符串，你写的那一长串regex只是适用于YYYY-mm-dd格式的日期。lubridate::as_date()在识别日期格式方面比base的as.Date()强化不少，举例：

datestr = c("1970-01-01","19700101","70-01-01")


lubridate::as_date(datestr)
#> [1] "1970-01-01" "1970-01-01" "1970-01-01"

as.Date(datestr)
#> [1] "1970-01-01" NA           "70-01-01"

所以我推荐用lubridate…

Cloud2016

tctcab 那一长串正则是我刚开始想的太美好，以为正则的命名捕捉可以实现这样的效果，就是返回值是一个这样的数据框或者列表

# 数据框
year  month day
1970 01 01
# 列表
year
1970

month
01

day
01

lubridate 确实好，大家的方案都比我的好！

歪个楼，我发现加载 lubridate 出现冲突

library(lubridate)

Attaching package: ‘lubridate’

The following object is masked from ‘package:base’:

    date

进一步我发现，不加载 lubridate ，自带的 Base R 版 date 和它内容一样，就是 lubridate::date() 和 base::date() 功能是一样的，这纯粹的重复造轮子，目的是迷惑用户以后使用date函数的时候，就固定去安装 lubridate 包

tctcab

Cloud2016
不一样

base::date()只是返回当前date，不带参数

lubridate::date() 多了个参数x，当然不带参数的时候还是调用的base::date()

dapengde

Liechi

Sometimes we all do things that, well, just don’t make no sense.
--- Forrest Gump

wangbinzjcc

正则表达式版：

extract_date <- read.table(    
     text = gsub("^(([0-9]{4})-([0-9]+)-([0-9]+)) .*", "\\1 \\2 \\3 \\4", eq[, 1]),
     col.names = c("date", "year", "month", "day")
 )

equake <- cbind(eq, extract_date)

head(equake)