Sentiment Analysis of Xiaohongshu Texts Based on the RoBERTa Model

Yidi Xie

doi:10.54254/2755-2721/2024.18301

1. Introduction

With the rapid growth of social media and e-commerce platforms, user-generated content has become a key resource for understanding consumer opinions and behavior. Xiaohongshu (also known as Little Red Book) is a popular social commerce platform in China where users share shopping experiences, reviews, and lifestyle tips. The sentiments expressed by Xiaohongshu users toward products and services provide valuable data for sentiment analysis. By extracting and analyzing these sentiments, businesses can gain insights into customer preferences, optimize product development, and enhance marketing strategies.

Sentiment analysis, a vital task in Natural Language Processing (NLP), aims to determine the emotional tone of a text. Traditional approaches, such as machine learning models based on bag-of-words and TF-IDF, have shown effectiveness in analyzing structured text but struggle to capture the complex semantics and context inherent in user-generated content [1]. The advent of deep learning and pretrained language models has provided more powerful tools for in-depth text understanding in sentiment analysis.

Among these models, BERT (Bidirectional Encoder Representations from Transformers) and its variants have become state-of-the-art for many NLP tasks [2]. RoBERTa (A Robustly Optimized BERT Pretraining Approach) builds on BERT by optimizing training strategies to achieve better performance across tasks. RoBERTa, combined with bidirectional Long Short-Term Memory (BiLSTM) networks and attention mechanisms, provides a robust solution for analyzing sentiment in complex user reviews [3]. The proposed model combines RoBERTa’s pretraining, BiLSTM, and attention mechanisms, making it well-suited for Xiaohongshu’s informal and multi-dimensional language.

This paper presents a novel approach to sentiment analysis of Xiaohongshu texts using the RoBERTa model combined with BiLSTM and attention mechanisms. By leveraging RoBERTa’s pretraining on large corpora and adding layers to capture both forward and backward semantics, this study aims to enhance the accuracy and depth of sentiment classification for Xiaohongshu reviews. The research will compare the performance of the proposed RoBERTa-based method with other commonly used sentiment analysis models and evaluate its effectiveness in identifying positive and negative sentiments within Xiaohongshu’s user-generated content.

2. Literature Review

2.1. Sentiment Analysis in Natural Language Processing

Sentiment analysis, also known as opinion mining, aims to identify the emotional tone of text and categorize it as positive, negative, or neutral. Early methods used lexicon-based approaches and traditional machine learning algorithms like Naive Bayes and Support Vector Machines (SVM) [4,5], relying on handcrafted features such as bag-of-words (BoW) and term frequency-inverse document frequency (TF-IDF). However, these methods struggled to capture complex semantics and context [6]. The advent of deep learning shifted sentiment analysis towards feature learning directly from data, eliminating the need for manual inputs, especially with recurrent neural networks (RNN) and long short-term memory (LSTM) networks that effectively preserve contextual information in sentences [7,8,9]. Nonetheless, these models faced challenges in handling long-range dependencies and deep semantic information [10].

2.2. Pretrained Language Models: BERT and RoBERTa

A major breakthrough in NLP came with pretrained language models, particularly BERT (Bidirectional Encoder Representations from Transformers). BERT was designed to understand bidirectional word context by pretraining on large corpora with objectives like Masked Language Model (MLM) and Next Sentence Prediction (NSP), significantly improving tasks like sentiment analysis and text classification [11]. RoBERTa (A Robustly Optimized BERT Pretraining Approach) was introduced as an enhanced version of BERT, removing the NSP objective, increasing batch size, and employing dynamic masking, which improved model generalization. These optimizations allowed RoBERTa to achieve state-of-the-art performance in many NLP benchmarks, making it a preferred model for complex tasks like sentiment analysis [12].

2.3. Applications of RoBERTa in Sentiment Analysis

RoBERTa’s ability to handle multilingual and informal language makes it highly relevant for sentiment analysis on platforms like Xiaohongshu, where users often use multiple languages and colloquial expressions [13]. Studies have shown that RoBERTa outperforms traditional models like Word2Vec and RNNs in terms of accuracy and F1-score when analyzing social media and e-commerce reviews, indicating its effectiveness in capturing complex linguistic patterns in user-generated content [13].

2.4. Advancements in Deep Learning Techniques: BiLSTM and Attention Mechanisms

Recent advancements in deep learning, especially the integration of Bidirectional Long Short-Term Memory (BiLSTM) networks and attention mechanisms, have significantly enhanced sentiment analysis models, improving accuracy and efficiency. BiLSTM networks process text in both forward and backward directions, capturing contextual information from preceding and succeeding tokens in a sequence, which is crucial in sentiment analysis where sentence meaning often depends on multiple parts [14,15,16]. Attention mechanisms allow models to focus on the most important words or phrases in a text by assigning weights to input elements, enhancing the model’s sensitivity to emotional cues in user-generated content [18,19,20].

2.5. Sentiment Analysis on Xiaohongshu

Sentiment analysis on Xiaohongshu, a social commerce platform, presents unique challenges due to the informal, brief, and multimedia-rich nature of user reviews. These characteristics complicate text-based sentiment extraction, yet Xiaohongshu analysis has great potential for insights into consumer preferences. Early studies mainly used lexicon-based or simple machine learning models, but these struggled with Xiaohongshu's complex context and language [21,22]. The RoBERTa-BiLSTM-Attention model addresses these challenges effectively by capturing context, adapting to language variability, and accurately identifying sentiment in diverse user-generated content.

3. Methodology

3.1. Dataset construction

Text data from Xiaohongshu notes, collected via web crawling from July to September 2024, was preprocessed to address its informal linguistic style. The preprocessing steps included:

1. Removing special characters: Noise such as user tags, hashtags, and emojis was filtered out using Python's re module.

2. Merging columns: Titles and main text were combined into one column to include all emotional cues using the pandas library.

3. Filtering abnormal data: Entries with fewer than three characters or duplicates were removed to improve data quality.

4. Manual labeling: Sentiments were labeled as positive (1) or negative (0).

A total of 10,747 notes were collected, encompassing lifestyle tips, product reviews, and shopping experiences. Of these, 5,290 entries were labeled as positive and 5,457 as negative. The dataset was split into training (80%), validation (10%), and test sets (10%), ensuring a balanced distribution of sentiment categories.

3.2. Experimental setup

3.2.1. Configure the experimental environment. The model was trained using an RTX 3080 GPU (10GB) in an Ubuntu-20.04 environment with Pytorch and Transformers. Key software and hardware parameters included:

CPU: Intel Xeon Platinum 8255C

Memory: 40GB

Python Version: 3.11

3.2.2. Hyperparameter configuration. In order to train a better model, the setting of model parameters is very important. After several rounds of experimental testing, the values of the relevant parameters in the sentiment analysis model in this chapter were finally determined. This is shown in Table 1.

Table 1. Hyperparameter Configuration Table

Hyperparameters	Value
Pretrained Models	RoBERTa-wwm-ext
Maximum Text Length	128
Learning Rate	1.00E-05
Dropout	0.1
Batch Size	64
Optimizer	Adam

3.2.3. Baseline model. To further validate the effectiveness of the RoBERTa-BiLSTM-Attention sentiment analysis To validate the RoBERTa-BiLSTM-Attention model, its performance was compared with:

Word2Vec-LSTM: Word2Vec with LSTM for sentiment analysis.

Word2Vec-BiLSTM: Word2Vec with BiLSTM for improved contextual understanding.

BERT-BiLSTM: Combines BERT with BiLSTM for semantic feature extraction.

BERT-BiLSTM-Attention: Incorporates an attention mechanism for enhanced sentiment analysis.

3.2.4. Evaluation indicators. Model performance was measured using Accuracy and F1-score: Accuracy evaluates overall prediction correctness. F1-score balances precision and recall, highlighting robustness in imbalanced datasets.

The formula for calculating Accuracy is shown in Eq. (3.1).

\( Accuracy= \frac{TP+TN}{TP+TN+FP+FN}\ \ \ (3.1) \)

F1 is calculated as shown in Eq. (3.2).

\( F1= \frac{2* precision* recall}{precision* recall}\ \ \ (3.2) \)

Thereo:

\( precision= \frac{TP}{TP+FP}\ \ \ (3.3) \)

\( recall= \frac{TP}{TP+FN}\ \ \ (3.4) \)

4. Results

To evaluate the effectiveness of the proposed RoBERTa-BiLSTM-Attention model, comparative experiments were conducted with baseline models, including Word2vec-LSTM, Word2vec-BiLSTM, BERT-BiLSTM, and BERT-BiLSTM-Attention. Accuracy and F1-score were used as primary evaluation metrics, as they measure prediction correctness and the balance between precision and recall. Table 2 summarizes the results, showing that the proposed RoBERTa-BiLSTM-Attention model outperforms all baseline models with the highest accuracy (91.91%) and F1-score (91.39%).

Table 2. Comparison of experimental results for different models

Model	Accuracy	F1
Word2vec-LSTM	88.01%	86.90%
Word2vec-BiLSTM	88.66%	87.73%
BERT-BiLSTM	90.24%	89.55%
BERT-BiLSTM-Attention	90.80%	90.19%
RoBERTa-BiLSTM-Attention	91.91%	91.39%

The RoBERTa-BiLSTM-Attention model excels due to its advanced word embeddings from RoBERTa, which capture richer semantic information. Adding BiLSTM enhances forward and backward semantic understanding, while the attention mechanism emphasizes key emotional features. Further analysis examined the impact of sentence embedding. By removing the sentence embedding component (RBLA-Seq variant), the model showed reduced accuracy (91.26%) and F1-score (90.67%), highlighting the importance of sentence embedding in improving performance. The comparative results are summarized in Table 3.

Table 3. Comparative experimental results

Model	Accuracy	F1
RoBERTa-BiLSTM-Attention	91.91%	91.39%
RBLA-Seq	91.26%	90.67%

The proposed model processes Xiaohongshu's text by generating word and sentence embeddings with RoBERTa. These embeddings are input into a BiLSTM layer to capture contextual semantics, and the attention mechanism enhances focus on key sentiment cues. Concatenating word embeddings with sentence embeddings further improves sentiment classification performance, particularly for Xiaohongshu's informal content.

5. Discussion

The comparative analysis highlights the unique advantages of the RoBERTa-BiLSTM-Attention model for Xiaohongshu sentiment analysis. Integrating RoBERTa with BiLSTM and attention mechanisms enables a more nuanced understanding of user sentiments. While Word2vec models struggled with polysemy, and BERT-based models showed contextual improvements, only the RoBERTa-BiLSTM-Attention model offered both deep semantic capture and weighting of sentiment-critical words, making it highly effective for Xiaohongshu's informal content. The inclusion of sentence embeddings further enhanced semantic coherence and performance. Experiments without sentence embedding confirmed their contribution to model accuracy, demonstrating their importance in handling complex social media texts. These findings validate the potential of advanced language models and attention mechanisms in addressing linguistic challenges of user-generated content and enhancing commercial sentiment analysis applications.

6. Conclusion

This study proposed a RoBERTa-BiLSTM-Attention model that significantly improves sentiment analysis accuracy for Xiaohongshu's informal content. By capturing nuanced semantic relationships and leveraging bidirectional semantic and attention mechanisms, the model outperformed traditional approaches. These findings highlight the importance of advanced techniques in processing user-generated content. Future research could extend this model to multilingual sentiment analysis, further refining its ability to capture and understand user sentiments across diverse social commerce platforms.

Acknowledgments

I would like to extend my gratitude to Professor Osman Yagan from Carnegie Mellon University for his insightful course on Social and Information Networks. The knowledge and techniques gained from this workshop, especially on topics like information propagation and misinformation analysis, greatly inspired and shaped the framework of this study. His guidance through the course's final project has been invaluable in enriching my understanding of the complex dynamics of social networks and their applications.

References

[1]. Lei Zhang, B. Liu. “Sentiment Analysis and Opinion Mining.” Synthesis Lectures on Human Language Technologies(2012).

[2]. Jacob Devlin, Ming-Wei Chang et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” North American Chapter of the Association for Computational Linguistics (2019).

[3]. Muhammad Rizwan Rashid Rana, Asif Nawaz et al. “Sentiment Analysis of Product Reviews Using Transformer Enhanced 1D-CNN and BiLSTM.” Cybernetics and Information Technologies(2024).

[4]. B. Pang, Lillian Lee et al. “Thumbs up? Sentiment Classification using Machine Learning Techniques.” ArXiv(2002).

[5]. A. Mitra. “Sentiment Analysis Using Machine Learning Approaches (Lexicon based on movie review dataset).” Journal of Ubiquitous Computing and Communication Technologies(2020).

[6]. Nayeli Hernández, I. Batyrshin et al. “Evaluation of deep learning models for sentiment analysis.” J. Intell. Fuzzy Syst.(2022).

[7]. Yue Zhang, Duy-Tin Vo. “Neural Networks for Sentiment Analysis.” Conference on Empirical Methods in Natural Language Processing (2016).

[8]. Dr. G. S. N. Murthy, Shanmukha Rao et al. “Text based Sentiment Analysis using LSTM.” International Journal of Engineering Research and (2020).

[9]. Alpna Patel, A. Tiwari. “Sentiment Analysis by using Recurrent Neural Network.” Materials Science eJournal (2019).

[10]. Faliang Huang, Xuelong Li et al. “Attention-Emotion-Enhanced Convolutional LSTM for Sentiment Analysis.” IEEE Transactions on Neural Networks and Learning Systems(2021).

[11]. Jacob Devlin, Ming-Wei Chang et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” North American Chapter of the Association for Computational Linguistics (2019).

[12]. Yinhan Liu, Myle Ott et al. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” ArXiv (2019).

[13]. K. N. Prasanthi, Rallabandi Eswari Madhavi et al. “A Novel Approach for Sentiment Analysis on social media using BERT & ROBERTA Transformer-Based Models.” 2023 IEEE 8th International Conference for Convergence in Technology (I2CT) (2023).

[14]. Guixian Xu, Yueting Meng et al. “Sentiment Analysis of Comment Texts Based on BiLSTM.” IEEE Access (2019).

[15]. Linkai Luo, Haiqing Yang et al. “EmotionX-DLC: Self-Attentive BiLSTM for Detecting Sequential Emotions in Dialogues.” ArXiv (2018).

[16]. Shervin Minaee, Elham Azimi et al. “Deep-Sentiment: Sentiment Analysis Using Ensemble of CNN and Bi-LSTM Models.” ArXiv (2019).

[17]. Md Abrar Jahin, Md Sakib et al. “TRABSA: Interpretable Sentiment Analysis of Tweets using Attention-based BiLSTM and Twitter-RoBERTa.” ArXiv (2024).

[18]. Spyridon Kardakis, I. Perikos et al. “Examining Attention Mechanisms in Deep Learning Models for Sentiment Analysis.” Applied Sciences (2021).

[19]. Wei Meng, Yongqing Wei et al. “Aspect Based Sentiment Analysis with Feature Enhanced Attention CNN-BiLSTM.” IEEE Access (2019).

[20]. Huiyu Han, Jin Liu et al. “Attention-Based Memory Network for Text Sentiment Classification.” IEEE Access (2018).

[21]. Maite Taboada, Julian Brooke et al. “Lexicon-Based Methods for Sentiment Analysis.” Computational Linguistics (2011).

[22]. Minghui Huang, Haoran Xie et al. “Lexicon-Based Sentiment Convolutional Neural Networks for Online Review Analysis.” IEEE Transactions on Affective Computing (2020).

Cite this article

Xie,Y. (2024). Sentiment Analysis of Xiaohongshu Texts Based on the RoBERTa Model. Applied and Computational Engineering,113,46-51.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 2nd International Conference on Machine Learning and Automation

ISBN：978-1-83558-775-1(Print) / 978-1-83558-776-8(Online)

Editor：Mustafa ISTANBULLU

Conference website: https://2024.confmla.org/

Conference date: 21 November 2024

Series: Applied and Computational Engineering

Volume number: Vol.113

ISSN：2755-2721(Print) / 2755-273X(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).