是否需要使用awk预处理文件,或者可以直接在R中完成?

我以前用awk处理csv文件,这是我的第一个脚本:

tail -n +2 shifted_final.csv | awk -F, 'BEGIN {old=$2} {if($2!=old){print $0; old=$2;}}' | less

此脚本在第2列中查找重复值(如果第n行上的值与第n行上的值相同,n 2 …)并且仅打印第一次出现的值.例如,如果您输入以下输入:

ord,orig,pred,as,o-p
1,0,0,1.0,0
2,0,0,1.0,0
3,0,0,1.0,0
4,0,0,0.0,0
5,0,0,0.0,0
6,0,0,0.0,0
7,0,0,0.0,0
8,0,0,0.0,0
9,0,0,0.0,0
10,0,0,0.0,0
11,0,0,0.0,0
12,0,0,0.0,0
13,0,0,0.0,0
14,0,0,0.0,0
15,0,0,0.0,0
16,0,0,0.0,0
17,0,0,0.0,0
18,0,0,0.0,0
19,0,0,0.0,0
20,0,0,0.0,0
21,0,0,0.0,0
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0

然后输出将是:

1,0,0,1.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0

编辑:
我添加第二个脚本让我有点挑战:

第二个脚本执行相同操作但打印最后一次重复出现:

tail -n +2 shifted_final.csv | awk -F, 'BEGIN {old=$2; line=$0} {if($2==old){line=$0}else{print line; old=$2; line=$0}} END {print $0}' | less

它的输出将是:

22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0

我认为R是应该处理这些任务的强大语言,但我发现只有从R等调用awk脚本的问题.如何在R中执行此操作?

最佳答案
关于你的问题的更新,一个更通用的解决方案,感谢@nicola:

Idx.first <- c(TRUE, tbl$orig[-1] != tbl$orig[-nrow(tbl)])
##
R> tbl[Idx.first,]
#    ord orig pred as o.p
# 1    1    0    0  1   0
# 23  23    4    0  0   4
# 24  24  402    0  1 402
# 25  25    0    0  1   0

如果你想在运行中使用最后一次出现的值而不是第一次出现的值,只需将TRUE追加到@ nicola的索引表达式而不是在它前面加上:

Idx.last <- c(tbl$orig[-1] != tbl$orig[-nrow(tbl)], TRUE)
##
R> tbl[Idx.last,]
#    ord orig pred as o.p
# 22  22    0    0  0   0
# 23  23    4    0  0   4
# 24  24  402    0  1 402
# 25  25    0    0  1   0

在任何一种情况下,tbl $orig [-1]!= tbl $orig [-nrow(tbl)]将第2列中的第2到第n个值与第2列中的第1到第n-1值进行比较.结果是合乎逻辑的.向量,其中TRUE元素表示连续值的变化.由于比较长度为n-1,因此将额外的TRUE值推到前面(情况1)将选择运行中的第一次出现,而向后面添加额外的TRUE(情况2)将选择最后一次出现.跑.

数据:

tbl <- read.table(text = "ord,orig,pred,as,o-p
1,0,0,1.0,0
2,0,0,1.0,0
3,0,0,1.0,0
4,0,0,0.0,0
5,0,0,0.0,0
6,0,0,0.0,0
7,0,0,0.0,0
8,0,0,0.0,0
9,0,0,0.0,0
10,0,0,0.0,0
11,0,0,0.0,0
12,0,0,0.0,0
13,0,0,0.0,0
14,0,0,0.0,0
15,0,0,0.0,0
16,0,0,0.0,0
17,0,0,0.0,0
18,0,0,0.0,0
19,0,0,0.0,0
20,0,0,0.0,0
21,0,0,0.0,0
22,0,0,0.0,0
23,4,0,0.0,4
24,402,0,1.0,402
25,0,0,1.0,0",
header = TRUE,
sep = ",")

转载注明原文:是否需要使用awk预处理文件,或者可以直接在R中完成? - 代码日志