1. Introduction
The collision experiment between the Li Auto i8 and the CHENGLONG has been controversial on online platforms for its authenticity. After the tests, the A-pillar, B-pillar, C-pillar and door beams of the Li Auto i8 remained undamaged. All 9 airbags were deployed successfully, the battery pack showed no leakage or fire and the doors unlocked automatically with the door handles popping out as designed. In contrast, the four wheels of the CHENGLONG truck bounced up instantly upon collision, its cab separated from the rear cargo box. How Li Auto Inc. responds correctly to public opinion is important. A study has found that marketing decisions of customers rely on public opinion and the causal influence of public opinion has been verified by Dumitrescu-Hurlin Granger causality test [1]. Therefore, knowing the direction of public opinion is helpful for Li Auto in responding.
Existing researches show that LDA and BERT models can help find the direction of public opinion. LDA can generate subject headings, but it is easily disturbed by semantic ambiguity [2]. BERT is good at semantic capture, however, it relies on annotated data, and the topic mining ability is insufficient in unsupervised scenarios. The study proves that the combination of the two can not only retain the topic distribution characteristics of LDA but also optimize the topic details through BERT's semantic understanding, making the topic more accurate [3].
However, Yadav et al.’s study did not specify in the automotive field and there were few public opinion analysis studies focused on the intersection of the automotive industry and scientific research. Besides, this collision experiment is relatively new and there are few related studies. Therefore, this research is significant in exploring the applicability of LDA-BERT in the automotive industry and response of the collision experiment.
This research aims to find the themes in the public opinion based on LDA and BERT fusion models. Additionally, explore the effect of LDA and BERT fusion models in the text analysis of automotive public opinion. Furthermore, provide thinking directions for car companies to deal with similar public opinion events.
To address the above issues, the following sections of this paper will demonstrate how to obtain comments, generate topics and conduct sentiment analysis. Then, the results of the experiment will be presented and analyzed. Through the subject words mined by LDA, it assists BERT's emotional judgment and then finds the entry point for answering public opinion.
2. Literature review
To better understand the research context and identify the research gaps, this chapter will review the existing literature related to opinion analysis models and automotive public opinion research.
2.1. Direction of model choosing
Zhao's study [2] compares and analyzes the differences between Twitter and traditional news media in content topics through Twitter-LDA model, which finds that brands and products are more likely to attract user attention in social media. However, it is not specific in cars and the model is not universal. Yadav et al.’s study [3] proposes a hybrid topic modeling method that combines LDA, BERT and clustering techniques, which is aiming to solve the shortcomings of traditional topic models in terms of semantic understanding and topic coherence. However, the data is not specific in the automotive field.
2.2. The common focus of public opinion in the automotive industry
Wu et al.’s study [4] makes it clear that technical problems are the main cause of negative public opinion. Wu [5] provides a conclusion that negative public opinion is more easily identifiable. However, due to the limitations of model methods and fields of data, it is not suitable for this topic. New models are needed to improve semantic capture capabilities and incorporate new data to adapt to real-time changes in public opinion among car companies.
2.3. Model optimization
Venugopalan et al. [6] demonstrates that using seed words can guide the model to focus on the target topic, However, the field is not the automotive industry, and the theme words are not associated with emotions, it is not clear which attributes cause negative public opinion.
Ma et al。 [7] proposes a BERT-based domain-adaptive framework that distinguishes between different domain texts by training a BERT domain classifier. However, it does not analyze associative themes and does not combine LDA with BERT.
2.4. Innovation
This paper makes innovations on those bases, combining the optimized LDA-BERT fusion model with the latest automotive scientific research public opinion innovatively.
3. Methodology
Based on the research objectives and literature review, this chapter will elaborate on the research methodology adopted in this study, including experimental data, experimental design, and experimental processes.
3.1. Experimental data
3.1.1. Data sources
The data of this research comes from the comments, which are about the collision experiment between the Li Auto i8 and the CHENGLONG from the Weibo platform. The data under the posts is collected through the Python crawler tool based on requests library, which contains a total of 599 valid samples.
3.1.2. Data preprocessing
Clean and standardize the original text by using a standardized process to reduce noise interference:
Text cleaning: Removes URL links, numbers, special symbols and meaningless words through regular expressions
tokenization: Use Jieba to segment Chinese text and add exclusive glossaries in the automotive field to improve word segmentation accuracy.
stopword filtering: combine the Harbin Institute of Technology (HIT) stopword list with a custom automotive domain noise word list, which aims to eliminate words with no semantic contribution.
Data Augmentation: Two enhancement strategies are designed to solve the problem of insufficient sample size and category imbalance. The first one is synonym substitution, which is based on a thesaurus of emotional words in the automotive field by a random substitution with a 30% probability. The second one is sentiment word insertion, for text
3.2. Experimental design and variable setting
3.2.1. Core model architecture
The experiment adopts the "theme-sentiment" two-dimensional analysis framework, the core model is the BERT sentiment classification model that integrates LDA thematic features(LdaBertModel). The specific structure includes three components. Firstly, a basic module, which is a pre-trained BERT Model (bert-base-Chinese). Secondly, the LDA Model, whose input dimension equals to BERT hidden layer dimension plus number of topics and output dimension is 256. Thirdly, the classification layer is a fully connected layer, whose output dimension is 3 and corresponds to negative/neutral/positive sentiment.
3.2.2. Variable design
Table 1 presents the detailed design of variables in this experiment, which is crucial for ensuring the validity and reproducibility of the study. The independent variables include the number of LDA themes, classification weights and training hyperparameters, which are determined through multiple experiments to optimize the model performance. The dependent variable is the categorized performance indicators, which are used to evaluate the effectiveness of the LDA-BERT fusion model in sentiment classification. The control variables, such as fixed text length and random seed, are set to eliminate the interference of irrelevant factors on the experimental results.
|
Variable Type |
Specific Variable |
Value / Setting |
|
Independent variable |
Number of LDA themes |
4(determined by experiments) |
|
Classification weights |
Negative: Neutral: Positive = 5:1:2(determined by experiments) |
|
|
Training hyperparameters |
Learning rate = |
|
|
Dependent variable |
Categorized performance indicators |
(1) Accuracy; (2) Classification report (precision, recall, F1 value) |
|
Control variables |
Data processing |
Fixed text length = 128, random seed = 42 |
3.2.3. Theme correction mechanism
Design an LDA theme correction process based on BERT sentiment results: Predict the sentiment of all texts using the trained LdaBertModel (0 = negative, 1 = neutral, 2 = positive).
Calculate the word frequency by grouping the topic-emotion dimension and construct a topic-emotion vocabulary list. Determine the dominant sentiment for each theme (recognize the largest sample size under that theme). Filter the words that appear frequently in the dominant sentiment glossary in the original LDA subject headings, keeping the top ten as the corrected theme words.
3.3. Experimental processes
3.3.1. Data preparation stage
After the original comment data is read and cleaned, an enriched training dataset is generated through a process involving word segmentation, stop-word filtering, and data augmentation. Divide the training set and the validation set (ratio = 8:2, use hierarchical sampling to maintain the category distribution) .
3.3.2. LDA topic modeling phase
Constructing Dictionaries and Corpora (doc2bow vectors) based on preprocessed text to train an LDA model.
3.3.3. Model training and evaluation stage
Load the BERT tokenizer and pre-trained model and initialize LdaBertModel. Train the model and evaluate it on the validation set, recording accuracy and classification reports.
3.3.4. Subject heading quality evaluation stage
Generate the original LDA themes and themes corrected by BERT. Calculate the quality metrics for two sets of subject headings and count the proportion of words in the subject title that are consistent with the dominant emotion of the topic.
4. Method justification
4.1. Selected method
This research selects the logic of the LDA Topic Modeling + BERT Sentiment Analysis + Data Enhancement method. The consideration is that the goal of automotive public opinion analysis is to explore the relationship between specific topics and emotional tendencies, which helps assist car companies in locating product problems. there are three advantages of the methods. Firstly, the linkage between theme and emotion is deep, data enhancement is suitable for the car field. Furthermore, this solves the problem of sample scarcity. Additionally, it helps reduce the cost of labeling. Disadvantages of other methods are also under consideration. A single LDA can only find topics and a single BERT cannot summarize which topics cause emotions. Traditional machine learning methods, such as SVM have weak feature expression ability and poor interpretability. Unsupervised sentiment analysis tools (such as VADER, TextBlob) have poor Chinese adaptability and a lack of topic association. If a purely supervised deep learning model (such as fully annotated BERT) is selected, the annotation cost is too high and the generalization is insufficient.
5. Results and discussion
As can be seen from Table 2, Topic 1 is mainly related to authenticity, keywords such as 'live’ and 'physics’ indicate that the public thinks it does not meet the laws of physics. Topic 2 shows the expectation of the public to be a witness. Topic 3 involves 'IQ’, which demonstrates the negative evaluations for the collision test. 'fool’ and 'four wheels’ may indicate that the public is doubtful about the safety and suspect the reality of the test.
|
Topic 1 |
Collision (0.0269) |
Live (0.0167) |
Test (0.0162) |
i8 (0.0141) |
car owner (0.0115) |
|
Physics (0.0114) |
Exist (0.0101) |
rear-end collision (0.0082) |
No (0.0078) |
Directly (0.0078) |
|
|
Topic 2 |
Test (0.0258) |
Look (0.0241) |
one car (0.0199) |
national highway (0.0181) |
Once (0.0179) |
|
Collision (0.0135) |
Newton (0.0114) |
http (0.0112) |
Cn (0.0112) |
Up (0.0109) |
|
|
Topic 3 |
Cry (0.0254) |
IQ (0.0169) |
head-on collision(0.0160) |
car owner(0.0136) |
Li Auto i8(0.0116) |
|
Feel (0.0114) |
Collision (0.0112) |
Mass (0.0107) |
Operation (0.0099) |
Afraid (0.0097) |
|
|
Topic 4 |
No (0.0131) |
drive(0.0115) |
four wheels(0.0102) |
Fool (0.0099) |
Safety (0.0093) |
|
head-on collision (0.0090) |
know(0.0085) |
tank(0.0084) |
Speed (0.0082) |
Prove (0.0081) |
Table 3 shows that the fusion model has a high level of accuracy in classifying the sentiment of comments, which achieves 99% accuracy. For the negative comments, precision, recall and f1-score are all above 0.99, which demonstrates the model can identify negative comments accurately. For the neutral comments, three indicators also show high accuracy. Although the accuracy of positive comments is lower at 0.98, it still meets the requirements of practical application.
|
precision |
recall |
F1-score |
support |
|
|
negative |
0.99 |
1 |
0.99 |
234 |
|
neutral |
1 |
0.99 |
0.99 |
207 |
|
positive |
0.98 |
0.98 |
0.98 |
120 |
|
accuracy |
0.99 |
561 |
||
|
macro avg |
0.99 |
0.99 |
0.99 |
561 |
|
weighted avg |
0.99 |
0.99 |
0.99 |
561 |
Compared with the original LDA topics, the corrected topics in Table 4 are clearer in classification. Topic 1 shows that the public thinks it does not fit the laws of physics. Topic 2 involves most words about collision, which demonstrates that the public is doubtful about the authenticity of the test. Keywords in Topic 3 indicate that the evaluation of the public about the test is negative. 'speed’ and ’drive’ in Topic 4 show the suspicion of the public about the way of the test.
|
Topic 1 |
head-on collision |
live |
directly |
i8 |
physics |
|
exist |
rear-end collision |
no |
car owner |
heavy truck |
|
|
Topic 2 |
test |
head-on collision |
this |
up |
collision |
|
http |
cn |
national highway |
once |
try |
|
|
Topic 3 |
collision |
heavy truck |
head-on collision |
effect |
IQ |
|
real |
cry |
car owner |
Li Auto i8 |
feel |
|
|
Topic 4 |
head-on collision |
disadvantages |
four wheels |
speed |
i8 |
|
drive |
know |
normal |
no |
fool |
The collision in the subject is very frequently evaluated, which shows that everyone cares about the safety of the car. Words such as live broadcast and physics show that the public is concerned about the authenticity of the test. From the perspective of theme words, there are many negative words. The precision, recall F1-score of negative, neutral and positive words are all high, which can indicate that the prediction is accurate and with the experimental expectations.
Some of the themes in this result better reflect the relevance of the topic. The fusion of LDA and BERT identifies negative emotional subject words with great accuracy. However, the filtering of subject headings is also insufficient, not all subject words are closely related and data interference cannot be ignored.
It is recommended that car companies should be more cautious in safety-related publicity and public relations activities. Moreover, responding to negative public opinion on safety topics on time is also important. For future research directions, it is valuable to explore a smarter method for automatically determining the number of LDA topics.
6. Conclusion
This study aimed to explore how to accurately analyze automotive public opinions by integrating topic mining and sentiment classification, which can support automakers' decision-making. Additionally, it demonstrates whether LDA-BERT is suitable for automotive fields. To achieve these, a fusion method combining LDA for topic modeling, BERT for sentiment analysis and data augmentation was employed.
LDA successfully extracted four core topics. The BERT model, enhanced by data augmentation to alleviate class imbalance, achieved an accuracy of 98.93% in sentiment classification. Through the research, it was found that the fusion approach effectively tackled the challenges of automotive public opinion analysis. From the results, theme 1 may involve requiring enterprises to conduct live tests, which will make it more believable. The test violates the laws of physics. Theme 2 may involve offline testing instead of online, so that the public can watch it on the spot more convincingly. Theme 3 reflects people's subjective feelings about crash tests as anti-intellectual and theme 4 may demonstrate that people think there are irregular driving or operation results which lead to contradictory cognition. Therefore, Li Auto Inc. can organize a transparent test, increase investment in the research and development of crash safety technology and disclose the status and data of vehicle testing. In addition, communicate with customers regularly.
However, limitations also exist. The extraction of negative emotions in LDA topics is not obvious enough, and the thematic relevance of words is not strong enough. In addition, although the BERT model performs well in understanding Chinese semantics, it still has errors in identifying some emerging and niche automotive internet slang. More work could be done to optimize the word filtering function. For example, optimize the filtering logic of the subject to generate more relevant subject headings.
Despite these limitations, this study makes significant contributions. Methodologically, it provides a new technical path for automotive public opinion analysis, verifying the feasibility and effectiveness of integrating LDA and BERT in domain-specific public opinion analysis. Practically, the analysis results about integrating topics and sentiments can strongly support automakers in accurately locating public opinion focuses and formulating response strategies, thus having substantial application value.
References
[1]. Jeon J, Lim T, D KB, Seok. Effect of Online Word of Mouth on Product Sales: Focusing on Communication-Channel Characteristics. Asia Marketing Journal, Vol. 21: Iss. 2, Article 4, 2019 Jan. https: //doi.org/10.15830/amj.2019.21.2.73
[2]. Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan HF, Li XM. Comparing Twitter and Traditional Media Using Topic Models. SPRINGER NATURE, Vol. 6611, pp. 338-349, ECIR 2011. https: //doi.org/10.1007/978-3-642-20161-5_34.
[3]. Yadav AK, Gupta T, Kumar M, Yadav D. A Hybrid Model Integrating LDA, BERT, and Clustering for Enhanced Topic Modeling. SPRINGER NATURE, Vol. 59, pp.2381-2408, 2025 Feb. https: //doi.org/10.1007/s11135-025-02077-y.
[4]. Wu ZZ, He QF, Li JR, Bi GQ, Antwi-Afari MF. Public attitudes and sentiments towards new energy vehicles in China: A text mining approach. ScienceDirect, Vol 178, 2023 May. https: //doi.org/10.1016/j.rser.2023.113242.
[5]. Wu Z. Recognition and Prediction of Positive and Negative Opinions of Car Companies. Hans Publishers, Vol. 11, pp.121-132, 2021 Jan. https: //doi.org/10.12677/CSA.2021.111014.
[6]. Venugopalan M, Gupta D. An enhanced guided LDA model augmented with BERT based semantic strength for aspect term extraction in sentiment analysis. ScienceDirect. Vol. 246, 2022 Jun. https: //doi.org/10.1016/j.knosys.2022.108668.
[7]. Ma XF, Xu P, Wang ZG, Nallapati R, Xiang B. Domain Adaptation with BERT-based Domain Classification and Data Selection. ACL Anthology, Vol. Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pp.76-83, 2019 Nov. DOI: 10.18653/v1/D19-6109.
Cite this article
Guo,S. (2025). Identification and Analysis of Doubt Directions in Weibo Comment Sections Regarding the Collision Experiment Between Li Auto i8 and CHENGLONG Based on LDA and BERT. Applied and Computational Engineering,202,23-30.
Data availability
The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.
Disclaimer/Publisher's Note
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
About volume
Volume title: Proceedings of CONF-MLA 2025 Symposium: Intelligent Systems and Automation: AI Models, IoT, and Robotic Algorithms
© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and
conditions of the Creative Commons Attribution (CC BY) license. Authors who
publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons
Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this
series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published
version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial
publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and
during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See
Open access policy for details).
References
[1]. Jeon J, Lim T, D KB, Seok. Effect of Online Word of Mouth on Product Sales: Focusing on Communication-Channel Characteristics. Asia Marketing Journal, Vol. 21: Iss. 2, Article 4, 2019 Jan. https: //doi.org/10.15830/amj.2019.21.2.73
[2]. Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan HF, Li XM. Comparing Twitter and Traditional Media Using Topic Models. SPRINGER NATURE, Vol. 6611, pp. 338-349, ECIR 2011. https: //doi.org/10.1007/978-3-642-20161-5_34.
[3]. Yadav AK, Gupta T, Kumar M, Yadav D. A Hybrid Model Integrating LDA, BERT, and Clustering for Enhanced Topic Modeling. SPRINGER NATURE, Vol. 59, pp.2381-2408, 2025 Feb. https: //doi.org/10.1007/s11135-025-02077-y.
[4]. Wu ZZ, He QF, Li JR, Bi GQ, Antwi-Afari MF. Public attitudes and sentiments towards new energy vehicles in China: A text mining approach. ScienceDirect, Vol 178, 2023 May. https: //doi.org/10.1016/j.rser.2023.113242.
[5]. Wu Z. Recognition and Prediction of Positive and Negative Opinions of Car Companies. Hans Publishers, Vol. 11, pp.121-132, 2021 Jan. https: //doi.org/10.12677/CSA.2021.111014.
[6]. Venugopalan M, Gupta D. An enhanced guided LDA model augmented with BERT based semantic strength for aspect term extraction in sentiment analysis. ScienceDirect. Vol. 246, 2022 Jun. https: //doi.org/10.1016/j.knosys.2022.108668.
[7]. Ma XF, Xu P, Wang ZG, Nallapati R, Xiang B. Domain Adaptation with BERT-based Domain Classification and Data Selection. ACL Anthology, Vol. Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pp.76-83, 2019 Nov. DOI: 10.18653/v1/D19-6109.