Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
106 views
in Technique[技术] by (71.8m points)

regex - Extract specific lines of text in r

I have a .txt file with thousands of lines. In this file, I have a meta information about research articles. Every paper has information about Published year (PY), Title (TI), DOI number (DI), Publishing Type (PT) and Abstract (AB). So, the information of almost 300 papers exist in the text file. The format of information about first two article is as follows.

PT J
AU Filieri, Raffaele
   Acikgoz, Fulya
   Ndou, Valentina
   Dwivedi, Yogesh
TI Is TripAdvisor still relevant? The influence of review credibility,
   review usefulness, and ease of use on consumers' continuance intention
SO INTERNATIONAL JOURNAL OF CONTEMPORARY HOSPITALITY MANAGEMENT
DI 10.1108/IJCHM-05-2020-0402
EA NOV 2020
PY 2020
AB Purpose - Recent figures show that users are discontinuing their usage
   of TripAdvisor, the leading user-generated content (UGC) platform in the
   tourism sector. Hence, it is relevant to study the factors that
   influence travelers' continued use of TripAdvisor.
   Design/methodology/approach - The authors have integrated constructs
   from the technology acceptance model, information systems (IS)
   continuance model and electronic word of mouth literature. They used
   PLS-SEM (smartPLS V.3.2.8) to test the hypotheses using data from 297
   users of TripAdvisor recruited through Prolific.
   Findings - Findings reveal that perceived ease of use, online consumer
   review (OCR) credibility and OCR usefulness have a positive impact on
   customer satisfaction, which ultimately leads to continuance intention
   of UGC platforms. Customer satisfaction mediates the effect of the
   independent variables on continuance intention.
   Practical implications - Managers of UGC platforms (i.e. TripAdvisor)
   can benefit from the findings of this study. Specifically, they should
   improve the ease of use of their platforms by facilitating travelers'
   information searches. Moreover, they should use signals to make credible
   and helpful content stand out from the crowd of reviews.
   Originality/value - This is the first study that adopts the IS
   continuance model in the travel and tourism literature to research the
   factors influencing consumers' continued use of travel-based UGC
   platforms. Moreover, the authors have extended this model by including
   new constructs that are particularly relevant to UGC platforms, such as
   performance heuristics and OCR credibility.
ZR 0
ZA 0
Z8 0
ZS 0
TC 0
ZB 0
Z9 0
SN 0959-6119
EI 1757-1049
UT WOS:000592516500001
ER

PT J
AU Li, Yelin
   Bu, Hui
   Li, Jiahong
   Wu, Junjie
TI The role of text-extracted investor sentiment in Chinese stock price
   prediction with the enhancement of deep learning
SO INTERNATIONAL JOURNAL OF FORECASTING
VL 36
IS 4
BP 1541
EP 1562
DI 10.1016/j.ijforecast.2020.05.001
PD OCT-DEC 2020
PY 2020
AB Whether investor sentiment affects stock prices is an issue of
   long-standing interest for economists. We conduct a comprehensive study
   of the predictability of investor sentiment, which is measured directly
   by extracting expectations from online user-generated content (UGC) on
   the stock message board of Eastmoney.com in the Chinese stock market. We
   consider the influential factors in prediction, including the selections
   of different text classification algorithms, price forecasting models,
   time horizons, and information update schemes. Using comparisons of the
   long short-term memory (LSTM) model, logistic regression, support vector
   machine, and Naive Bayes model, the results show that daily investor
   sentiment contains predictive information only for open prices, while
   the hourly sentiment has two hours of leading predictability for closing
   prices. Investors do update their expectations during trading hours.
   Moreover, our results reveal that advanced models, such as LSTM, can
   provide more predictive power with investor sentiment only if the inputs
   of a model contain predictive information. (C) 2020 International
   Institute of Forecasters. Published by Elsevier B.V. All rights
   reserved.
CT 14th International Conference on Services Systems and Services
   Management (ICSSSM)
CY JUN 16-18, 2017
CL Dongbei Univ Finance & Econ, Sch Management Sci & Engn, Dalian, PEOPLES
   R CHINA
HO Dongbei Univ Finance & Econ, Sch Management Sci & Engn
SP Tsinghua Univ; Chinese Univ Hong Kong; IEEE Syst Man & Cybernet Soc
ZA 0
TC 0
ZB 0
ZS 0
Z8 0
ZR 0
Z9 0
SN 0169-2070
EI 1872-8200
UT WOS:000570797300025
ER

Now, I want to extract the abstract of each article and store it in the data frame. To extract the abstract I have the following code, which gives me the first match of abstract.

f = readLines("sample.txt")
#extract first match....
pattern <- "AB\s*(.*?)\s*ZR"
result <- regmatches(as.String(f), regexec(pattern, as.String(f)))
result[[1]][2]
[1] "Purpose - Recent figures show that users are discontinuing their usage
   of TripAdvisor, the leading user-generated content (UGC) platform in the
   tourism sector. Hence, it is relevant to study the factors that
   influence travelers' continued use of TripAdvisor.
   Design/methodology/approach - The authors have integrated constructs
   from the technology acceptance model, information systems (IS)
   continuance model and electronic word of mouth literature. They used
   PLS-SEM (smartPLS V.3.2.8) to test the hypotheses using data from 297
   users of TripAdvisor recruited through Prolific.
   Findings - Findings reveal that perceived ease of use, online consumer
   review (OCR) credibility and OCR usefulness have a positive impact on
   customer satisfaction, which ultimately leads to continuance intention
   of UGC platforms. Customer satisfaction mediates the effect of the
   independent variables on continuance intention.
   Practical implications - Managers of UGC platforms (i.e. TripAdvisor)
   can benefit from the findings of this study. Specifically, they should
   improve the ease of use of their platforms by facilitating travelers'
   information searches. Moreover, they should use signals to make credible
   and helpful content stand out from the crowd of reviews.
   Originality/value - This is the first study that adopts the IS
   continuance model in the travel and tourism literature to research the
   factors influencing consumers' continued use of travel-based UGC
   platforms. Moreover, the authors have extended this model by including
   new constructs that are particularly relevant to UGC platforms, such as
   performance heuristics and OCR credibility."

The problem is, I want to extract all the abstracts but the pattern would be different for most of the abstracts. So the specific pattern for all the abstract is that I should extract text starting from AB and every next line having space in the front. Any body can help me in this regard?

question from:https://stackoverflow.com/questions/65599307/extract-specific-lines-of-text-in-r

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You can first group the lines: whenever a line does not start with a space character the group counter is moved up by one.

Then you can aggregate f by group and select the abstracts from the aggregated vector:

group <- cumsum(!grepl("^ ", f))
f2 <- aggregate(f, list(group), function(x) paste(x, collapse = " "))[, 2]

f2[grepl("^AB ", f2)]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...