词频率矩阵 - Term frequency matrix

- 此内容更新于:2016-02-24
主题:

我有一个这样的字符串:m<-“abcdabcdbcadacbddabcc…“我想产生一个矩阵是这样的:我如何用r?

原文:

I have a string like this:

m<-"abcdabcdbcadacbddabcc..."

I would like to generate a matrix like this:

enter image description here

How can I do that in r?

网友:你怎么知道你想要的子字符串数?

(原文:How do you know the substrings which you want to count?)

楼主:子,,,…

(原文:Substrings are aaa, aab,aac,...)

解决方案:
这使我相信你之后:输出:
原文:

This gives what I believe you're after:

m <- "abcdabcdbcadacbddabcc"

library(qdap)

chars <- unique(unlist(strsplit(m, "")))
terms <- paste2(expand.grid(rep(list(chars), 3)), sep="")
t(counts(termco(m, match.list=sort(terms)))[, -c(1:2)])

Output:

    1
aaa 0
aab 0
aac 0
aad 0
aba 0
.
.
.
dcc 0
dcd 0
dda 1
ddb 0
ddc 0
ddd 0
解决方案:
函数的位置给你每一场比赛的模式。你可以这样做:
原文:

The function gregexpr gives you the position of each match of the pattern.

You can do this:

a <- c("a","b","c")
b <- matrix(outer(a,a,paste,sep=""),ncol=1)
patterns <- matrix(outer(a,b,paste,sep=""),ncol=1)

m<-"abcdabcdbcadacbddabcc..."

positions <- function(pattern, text) 
  gregexpr(pattern, text)[[1]][1]

sapply(patterns, positions, text=m)
楼主:谢谢你,有可能找到每个词的数吗?

(原文:Thank you, Is it possible to find the count of each term?)