Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
297 views
in Technique[技术] by (71.8m points)

splitstackshape - Splitting text to words with R and cSplit()

I'm trying to split a series of sentences into separate words, that is, to tokenize the text.

I have found an R package splitstackshape that is able to do what I want, well almost... it truncates the output to the first and last 5 rows.

Anyway, this is what I need to do:

id text
1 Lorem ipsum dolor sit amet
2 consectetur adipiscing elit
3 Donec euismod enim quis 
4 nunc fringilla sodales
5 Etiam tempor ligula vitae 
6 pellentesque dictum
7 Quisque non justo scelerisque 
8 est facilisis congue quis vel
9 Phasellus ex lorem
10 eleifend at magna vel
11 egestas eleifend massa

Output:

id word
1 Lorem
1 ipsum
1 dolor
1 sit
1 amet
2 consectetur
2 adipiscing
...

That is, I need words in separate rows, but with alongside the ID of the sentence it belongs to.

I was trying cSplit(data, "text", " ", "long"), but it truncates..


Update. FYI, here is how to do the reverse

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The cSplit function returns a data.table.

What you are describing is the default print behavior for data.tables. To see this in action, try the following:

library(data.table)
as.data.table(airquality)
print(as.data.table(airquality))

print(as.data.table(airquality), nrows = Inf)

Thus, to get the full table displayed, you can try:

library(splitstackshape)
print(cSplit(data, "text", " ", "long"), nrows = Inf)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...