Topic Modeling and Sentiment Analysis of Customers Using Natural Language Processing and Machine Learning Techniques

Rahimi, Razieh; Ghasemy Yaghin, Reza

doi:10.24200/j65.2025.65318.2418

Topic Modeling and Sentiment Analysis of Customers Using Natural Language Processing and Machine Learning Techniques

Articles in Press

Document Type : Article

Authors

Razieh Rahimi ¹
Reza Ghasemy Yaghin ²

¹ Textile Engineering Department,-Amirkabir University of Technology

² Department of Industrial Engineering and Managment Systems-Amirkabir University of Technology

10.24200/j65.2025.65318.2418

Abstract

Given that modeling and predicting customer behavior using data science helps companies gain a better understanding of customer behavior, this research focuses on analyzing customer reviews in the women’s clothing domain within e-commerce. We employ machine learning techniques and natural language processing (NLP) to achieve this goal. The machine learning models used include Support Vector Machine, Logistic Regression, Decision Tree, Random Forest, Multinomial Naive Bayes, Complement Naive Bayes, XGBoost, and LightGBM. To extract and vectorize text features from the reviews, we utilize the TF-IDF and Word2vec algorithms. We employ Topic Modeling using Latent Dirichlet Allocation (LDA) method and k-means clustering. The dataset consists of women’s clothing reviews, with the target variable being customer ratings in those reviews. The study is conducted in binary, three-class, and five-class scenarios. The target variable, which originally has five classes (scores 1 to 5), is categorized into two-class and three-class modes. In the two-class mode, scores below 3 are class zero, while scores of 3 and above are class one. In the three-class mode, scores below 3 are class zero, scores equal to 3 are class one, and scores above 3 are class two. In all three cases, the Random Forest model performs best, achieving an accuracy of 0.98 in the binary case, 0.95 in the three-class case, and 0.91 in the five-class case. After performing the required preprocessing and feature engineering, principal component analysis (PCA) and T SNE are applied. After that, the scatter diagram of the data is drawn and the optimal number of clusters 3 is estimated using the ELbow diagram. In the next step, by removing punctuation marks, stop words and words with less than three letters, converting the first letter of the words to lowercase and lemmatization, data cleaning was done. After that, topic modeling is done and each of the topics and words related to them are examined. In the next step, the topics are examined in different clusters. These analyzes provide a comprehensive understanding of the key themes and concerns customers have when considering womenswear items in each of the four topics.

Keywords

Main Subjects

بازاریابی و تجارت الکترونیک

Sharif Journal of Industrial Engineering & Management

Articles in Press, Accepted Manuscript
Available Online from 10 March 2025

Article View: 343
PDF Download: 141

Topic Modeling and Sentiment Analysis of Customers Using Natural Language Processing and Machine Learning Techniques

Articles in Press, Accepted Manuscript
Available Online from 10 March 2025

Files

Share

How to cite

Statistics

Topic Modeling and Sentiment Analysis of Customers Using Natural Language Processing and Machine Learning Techniques

Articles in Press, Accepted Manuscript Available Online from 10 March 2025

Files

Share

How to cite

Statistics

Articles in Press, Accepted Manuscript
Available Online from 10 March 2025