Google Reviews analysis¶
Introduction¶
Objective¶
The objective of this analysis is to provide a comprehensive overview of the customer feedback on Google Reviews. The goal is to:
- Understand User Sentiment: Examine the sentiments of users who have left reviews on Google, differentiating between positive and negative feedback.
- Identify Key Themes: Use topic modeling to identify recurring themes within the positive and negative reviews.
- Prioritize Areas for Improvement: Determine the key areas of concern and satisfaction among users based on the topics extracted.
Data Source¶
The dataset used for this analysis contains reviews from Google. The key fields include:
- Reviewer Name: The name of the reviewer.
- Local Guide: Indicator of whether the reviewer is a local guide.
- Number of Reviews: Total reviews written by the reviewer.
- Number of Photos: Total photos uploaded by the reviewer.
- Weeks Ago: The number of weeks ago when the review was written.
- Rating: The rating given by the reviewer.
- Review Text: The actual review provided by the reviewer.
- Number of Likes: The number of likes received for the review.
- Reviewer Nationality: Indicator of whether the reviewer is Italian.
- Review Language: The language in which the review was written.
Analysis Overview¶
Preprocessing:
- The data was initially inspected for any unusual patterns, such as repeated surnames.
- The dataset was then separated into positive and negative reviews based on the rating threshold of 2.5.
Sentiment Analysis:
- Word clouds were generated to visualize the most frequently used words in both positive and negative reviews, highlighting key aspects of user feedback.
Topic Modeling:
- Latent Dirichlet Allocation (LDA) was used to extract common topics from the positive and negative reviews.
- The average rating was calculated for each topic to understand the general sentiment associated with each.
Key Findings¶
Positive Feedback:
- Users who rated the software highly emphasized themes such as customer service, functionality, and overall satisfaction with the product.
Negative Feedback:
- Users who rated the software poorly highlighted issues such as pricing problems, communication difficulties, and service-related complaints.
Recurring Themes:
- The analysis revealed consistent patterns of feedback, suggesting areas of strength and opportunities for improvement.
Next Steps¶
- Customer Support: Focus on improving customer support based on the feedback from negative reviews.
- Pricing Adjustments: Review and adjust pricing strategies to address concerns raised by dissatisfied users.
- Feature Improvements: Explore opportunities to enhance features and functionality based on the positive feedback to reinforce user satisfaction.
In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
from pandas.plotting import register_matplotlib_converters
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
In [2]:
import pandas as pd
df = pd.read_csv('data/raw/reviews.csv')
In [3]:
# Update the surname extraction method
df['surname'] = df['name'].str.split().str[-1].apply(lambda x: x if len(x) > 1 else None)
# Count occurrences of each surname again
surname_counts = df['surname'].value_counts()
# Filter out surnames that appear more than once
recurring_surnames = surname_counts[surname_counts > 1].index.tolist()
# Filter the dataframe for entries with recurring surnames
recurring_surname_reviews = df[df['surname'].isin(recurring_surnames)]
In [4]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Step 1: Remove records involving the surnames "Rodella" and "Tedeschi"
clean_df = df[~df['surname'].isin(['Rodella', 'Tedeschi'])]
# Step 2: Separate the dataset into two groups
positive_reviews = clean_df[
(clean_df['rating'] > 2.5)
& (clean_df['review'].str.strip() != '')
& (~clean_df['review'].isnull())]
negative_reviews = clean_df[
(clean_df['rating'] <= 2.5)
& (clean_df['review'].str.strip() != '')
& (~clean_df['review'].isnull())]
In [6]:
# Step 3: Generate word clouds for each group
# Positive reviews word cloud
my_additional_stop_words = [
'Smartpricing', 'smartpricing'
]
stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)
positive_text = ' '.join(positive_reviews['review'].tolist())
positive_wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=stop_words).generate(positive_text)
# Negative reviews word cloud
negative_text = ' '.join(negative_reviews['review'].tolist())
negative_wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=stop_words).generate(negative_text)
# Plotting both word clouds side by side
fig, ax = plt.subplots(1, 2, figsize=(16, 8))
ax[0].imshow(positive_wordcloud, interpolation='bilinear')
ax[0].axis('off')
ax[0].set_title('Positive Reviews (>2.5)', fontsize=16)
ax[1].imshow(negative_wordcloud, interpolation='bilinear')
ax[1].axis('off')
ax[1].set_title('Negative Reviews (<=2.5)', fontsize=16)
plt.tight_layout()
plt.show()
In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
def extract_topics(text_data, num_topics=3, num_words=5):
# Vectorize the text data
vectorizer = CountVectorizer(stop_words=list(stop_words))
text_vectorized = vectorizer.fit_transform(text_data)
# Fit LDA model
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
lda.fit(text_vectorized)
# Extract topics
words = np.array(vectorizer.get_feature_names_out())
topics = []
for topic_idx, topic in enumerate(lda.components_):
top_words_idx = topic.argsort()[-num_words:][::-1]
topics.append([words[i] for i in top_words_idx])
return topics
# Extract topics from positive and negative reviews
positive_topics = extract_topics(positive_reviews['review'].tolist(), num_topics=3, num_words=5)
negative_topics = extract_topics(negative_reviews['review'].tolist(), num_topics=3, num_words=5)
positive_topics, negative_topics
Out[7]:
([['optimal', 'excellent', 'service', 'staff', 'helpful'], ['years', 'reviews', 'prices', 'benefits', 'time'], ['obviously', 'software', 'time', 'excellent', 'reality']], [['quick', 'deserved', 'despite', 'heard', 'prices'], ['offers', 'weeks', 'sells', 'event', 'adjust'], ['receive', 'emails', 'price', 'room', 'sold']])
Positive Reviews (Rating > 2.5)¶
Topic 1:
- Words: 'optimal', 'excellent', 'service', 'staff', 'helpful'
- Interpretation: This topic highlights the high level of satisfaction with customer service. Users praise the optimal and excellent support provided by the staff, who are consistently described as helpful.
Topic 2:
- Words: 'years', 'reviews', 'prices', 'benefits', 'time'
- Interpretation: Long-term users reflect positively on their extended use of the software, noting the beneficial impact on pricing over time and the positive evolution reflected in their reviews.
Topic 3:
- Words: 'obviously', 'software', 'time', 'excellent', 'reality'
- Interpretation: Users affirm the software’s effectiveness and reliability, pointing out that it excellently meets their needs over time, which they describe as an obvious reality of using the product.
Negative Reviews (Rating ≤ 2.5)¶
Topic 1:
- Words: 'quick', 'deserved', 'despite', 'heard', 'prices'
- Interpretation: This topic expresses users' frustration with the speed and manner in which pricing issues are handled, despite deserving better treatment and often feeling unheard.
Topic 2:
- Words: 'offers', 'weeks', 'sells', 'event', 'adjust'
- Interpretation: Users are dissatisfied with the timing and adjustment of offers and sales events, indicating potential misalignments in promotional strategies.
Topic 3:
- Words: 'receive', 'emails', 'price', 'room', 'sold'
- Interpretation: Operational inefficiencies are a significant concern, especially related to the reception of emails, room pricing strategies, and booking processes.
In [8]:
# Step 1: Identify topics for each review
# Function to identify topic for each review
def assign_topics(text_data, lda_model, vectorizer):
text_vectorized = vectorizer.transform(text_data)
topic_distribution = lda_model.transform(text_vectorized)
assigned_topics = topic_distribution.argmax(axis=1)
return assigned_topics
# Vectorize and apply LDA to identify topics for each review
def identify_topics_and_ratings(reviews, num_topics=3):
vectorizer = CountVectorizer(stop_words=list(stop_words))
text_vectorized = vectorizer.fit_transform(reviews.loc[:, 'review'])
lda_model = LatentDirichletAllocation(n_components=num_topics, random_state=42)
lda_model.fit(text_vectorized)
# Assign topics to reviews
reviews.loc[:, 'topic'] = assign_topics(reviews.loc[:, 'review'], lda_model, vectorizer)
return reviews
# Assign topics to positive and negative reviews
positive_reviews_with_topics = identify_topics_and_ratings(positive_reviews, num_topics=3)
negative_reviews_with_topics = identify_topics_and_ratings(negative_reviews, num_topics=3)
# Step 2: Calculate the median rating for each topic
positive_mean_ratings = positive_reviews_with_topics.groupby('topic')['rating'].mean().reset_index()
negative_mean_ratings = negative_reviews_with_topics.groupby('topic')['rating'].mean().reset_index()
positive_mean_ratings, negative_mean_ratings
/var/folders/6r/t8bfdcmx3j5g7vl5bpvvlp180000gp/T/ipykernel_4547/749697693.py:18: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy reviews.loc[:, 'topic'] = assign_topics(reviews.loc[:, 'review'], lda_model, vectorizer) /var/folders/6r/t8bfdcmx3j5g7vl5bpvvlp180000gp/T/ipykernel_4547/749697693.py:18: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy reviews.loc[:, 'topic'] = assign_topics(reviews.loc[:, 'review'], lda_model, vectorizer)
Out[8]:
( topic rating 0 0 5.0 1 1 5.0 2 2 5.0, topic rating 0 0 1.00 1 1 1.00 2 2 1.25)
In [ ]: