R中怎么去除重复数据

chaoyuan_101

假如两列都是相同的数据类型，都是字符，如果两列中存在a,b或者b,a就当做重复数据,或者两列的数据相同都是
x,x时，怎么去除重复数据呢？之前的帖子中的例子，我看了，也实际操作了一下，然而并不行。

 index1<-duplicated(a1[,1])
 index2<-duplicated(a1[,2])
 index=index1&index2
 a2=a1[!index,]
 a2
  first last
1     q    e
2     a    b
3     o    e
4     b    a
5     c    h
6     x    x

请问下可以怎么解决呢？

tctcab

chaoyuan_101

按照你的描述来看，对“重复数据”的定义有点复杂，不适合直接套用duplicated或者unique 。试试这个：

library(data.table)
library(dplyr)

# read data
a= "
id first last
1 q e
2 a b
3 o e
4 b a
5 c h
6 x x
"
tmp <- fread(a)

# construct new column idx
tmp2 <- 
  tmp %>%
# filter x == x
  filter(first != last) %>%
# construct new column idx: ab, ba will have the same idx based on the alphabet order
  mutate(idx = ifelse(first > last, paste0(first,last),paste0(last,first))) %>%
# filter out duplicated rows based on idx
  filter(!duplicated(idx))


## output, row 4 and 6 are removed
tmp2

#########

  id first last idx
1  1     q    e  qe
2  2     a    b  ba
3  3     o    e  oe
4  5     c    h  hc

ryo

chaoyuan_101 可参阅<http://rpubs.com/englianhu/natural-language-analysis>