We all use social media. Facebook, Instagram, YouTube, LinkedIn & co. have been part of our lives for a long time now. We subscribe, we post, we like, we comment, we share, and we click – without keeping track of the data we generate.
But don’t worry, some people do keep track. Not only do they gather the data we generate, but social media analysts also evaluate it and use it for making business-related decisions. This article will explain the principles and importance of social media analytics and show a simple analysis of social media conversations using Python.
What is Social Media Analytics, and why is it Important?
Social media analytics represents a form of data-driven decision-making (DDDM).
DDDM, in general, considers using data to inform the decision-making process, and data used in the case of social media analytics is data obtained from social media networks.
Social media analysts collect and analyze data produced through social media engagement to get better insight into users’ sentiments, intentions, and behaviors. Knowing their clients helps companies evaluate the performance of their social media channels and modify marketing campaigns to fit client’s needs and preferences better.
Let’s say an international beauty company is trying to break into a country’s market. They will likely engage local influencers to introduce the company to the public, say a few nice words, show products or organize a giveaway. By analyzing user engagement, such as the number of likes and shares, and comments sentiment (were they mostly positive or negative), the company’s management can inform themselves in many important aspects:
- How successful was the campaign?
- Which platform should be in focus (Facebook, Instagram, or something else?)
- With which influencer is likely to continue cooperation?
- Which advertising form is more successful than others (photos, videos, written posts?)
- Which product(s) are likely to be more popular than others?
- What is the clients’ main concern or objective?
- Which cities/regions should be considered when opening new shops?
As the above illustrates, the ability to gather data from social media and to find meaning in it can be of great help in supporting business decisions and later measuring their consequences. Obtained results are likely to affect product development, branding, marketing strategies, and other business aspects that are not directly related to marketing campaigns.
Listening to the audience, knowing them, and constantly interacting with them can help companies reduce costs, increase profit and avoid a potential crisis. There are many examples of companies that have enormous benefits from social media and social data analytics.
A less-known aspect of social media analytics is the possibility for a company to track and improve the efficiency of its own marketing team. Interpreting results can give insights into what should be changed and what can be automated and help plan the budget for the upcoming period.
Social Media Metrics
The fact that almost everything done on social media can be tracked through different metrics is both an advantage and a disadvantage. One can easily get lost in a bunch of numbers: clicks, likes, shares, and comments! But some of those might be utterly irrelevant to their use case.
We can roughly group social media metrics into six groups.
Performance metrics help understand which social media marketing efforts are effective and which need improvement. These metrics are mostly straightforward to obtain as well as interpret. They include tracking reach (number of people that saw content), impressions (number of times people saw content), growth rate (increase in the number of followers during a specific time period), engagement (number of reactions, comments, and shares) and click-through rate (shows how often people click links in the content to access additional content, also known as CTR) across different platforms.
Audience metrics help understand the audience. Defining the target audience and adequately targeting the content is crucial to social media strategy. The better targeted the content is, the less expensive the marketing campaign is expected.
Data gathered across social media platforms is further used to analyze audience demographics, interests, and behaviors.
Competitor metrics help in understanding competitors. Knowing how they stack up to competitors’ performance is a good starting point for a company to develop its social media strategies.
Paid social metrics help understand if social media strategies are paying off. They provide insights into the return generated for each effort, as well as insights into competitors’ outcomes for marketing campaigns. Important metrics to track are the number of active ads, total ad cost, total clicks, click-through rate, cost per click, cost per engagement, cost per action, and cost per purchase.
To make their marketing efforts successful, companies often collaborate with influencers. Influencer metrics help understand if a partnership with influencers is meeting expectations and consider analyzing the following aspects: interactions per 1000 followers, audience size, frequently used hashtags, number of posts, and influencer’s previous partnerships.
Sentiment analysis is the most complex method as it uses natural language processing (NLP) to understand whether social conversations about a company or brand are positive or negative. It helps knowledge whether a campaign is gaining or losing the audience’s favorability. Analyzing customer sentiment can help to decide which content will have a more positive impact on an audience, as well as help detect potential issues with customer care services.
Social media metrics that a company should track highly depend on their goal(s). If the goal is to grow your Instagram profile, then the number of follows and likes should be in focus. Likewise, if the goal is sales, metrics that include views or click-through rates are more relevant.
Therefore, the first step in choosing metrics is setting a goal as precisely as possible.
Social Media Analytics Tools
Social media analytics uses specifically designed software that works similarly to web search tools. Using keywords or topics, data is retrieved through search queries or web crawlers. Returned text fragments are loaded into a database, categorized, and analyzed to derive meaningful insights.
Typical social media analytics tools use natural language processing (NLP) and machine learning (to identify entities and relationships in unstructured data), segmentation (to categorize social media participants), clustering analysis (to discover hidden relations between keywords or phrases that can further reveal new topics, issues, and opportunities) and dashboards and visualization (charts, graphs or other tools that summarize and visualize findings and make their consumption optimal).
Some of the most popular social media analytics tools are Hootsuite Analytics, Google Analytics, and RivalIQ. The majority of the tools used nowadays are paid for. However, with a few programming skills and an analytical mindset, you can build your analytics tool.
The following paragraph contains a simple example of analyzing comments gathered from different social media platforms using Python.
Analysis of Social Media Conversations Using Python
Data Structure and Cleaning
The first step is naturally gathering all relevant conversations. In this example, we’ll use an existing dataset downloaded from www.kaggle.com. It contains public comments on social media from a three-month period (June 2022 – September 2022) with the keyword “breastfeeding.”
For preprocessing, analysis, and visualization, we’ll use Python and its libraries pandas, re, nltk, wordcloud, matplotlib, and plotly. After importing the libraries and reading the csv file into a data frame called df, we can start getting to know the data.
Using df.sample(10) we can observe a random sample of 10 records.
To get a quick overview of the entire dataset, we can use df.info(). It will return:
- number of rows (238.045)
- number of columns (6)
- column names, types of data within, as well as the count of non-null values per column
At first glance, we see some columns are not relevant for analysis (status_id, display_text_width), as there are missing (null) values in columns text and country. Although interesting for research in general, the country column has only 485 populated values and, therefore, will not be used. We can also spot the potential for data conversion, as we’d rather have created_at provided as date time, not as an object.
After dropping irrelevant columns, rows with missing text, duplicated rows, and performing necessary data conversions, we have a dataset with 202.279 rows and 3 columns.
Dates and Days Analysis
Now that we have relatively clean data let’s see if there’re any easy-to-draw conclusions.
The first thing we might ask ourselves is: Does the number of comments depend on date, i.e., were there any periods when the analyzed topic was more popular than usual?
If we look at the comments per date distribution, we see a few interesting peaks.
Firstly, breastfeeding was mentioned in the first week of August quite more often than before. If we hit Google, we can see that WABA (World Alliance for Breastfeeding Action) is marking World Breastfeeding Week from the 1st until the 7th of August. This explains the increased number of social interactions during the beginning and the rest of this month.
But what about the peak at the very end of August? Well, on August the 30th, 2023, NWA (National Women, Infants, and Children Association) declared 2022 Breastfeeding Champions, which obviously made the topic explode!
On August the 30th, 24.823 comments related to breastfeeding were left!
Obviously, August is the “it” month for discussing breastfeeding. On average, 3.911 comments per day were left in August, 2.186 in June, 1.746 in July, and only 1.626 in September. It looks like the public got tired of discussing breastfeeding after the exhausting month of August.
Next, let’s check if the number of comments depends on the days of the week. Is this topic more interesting during weekends than on weekdays?
Well, it looks not! Sunday, Friday, and Saturday are the days with the least comments. Although there are no particularly standing out days, we can see that Tuesday and Wednesday are the days with the most comments. We tend to turn ourselves to social media when weekend enthusiasm goes down. We’ll see later if weekday affects comment sentiment too.
Comments Text Analysis
To see the most common words appearing in analyzed comments, it’s helpful to first change the text to full lowercase and remove URLs and tags. To our data frame, preprocessed comments can be added as a separate column, ‘preprocessed_text’.
The next step is the removal of stop words. Stop words are commonly used words (such as articles, pronouns, auxiliary verbs, etc.) that are considered useless data in NLP. We’ll use the NLTK (Natural Language Toolkit)’s list of English stop words. It’s as easy as importing the library and downloading the list of stop words. If needed, we can extend the list with our own words.
Once defined, the stop words list will be used as the ‘stopwords’ parameter in the WordCloud constructor. As we can see on the word cloud below, “breastfeeding” is clearly the most frequent word. This perfectly makes sense as it was used as a keyword during data collection. Other frequent words are “children”, “pregnant”, “breast”, “mother”, “women”, “milk”, “baby,” and “mother” – all closely related to the subject of breastfeeding.
There is also a significant amount of words such as “get”, “one”, “still” or “two”, which are clearly irrelevant. For further improvement of our simple tool, a list of stop words should be extended with these words.
Sentiment Analysis
In the end, let’s dig into the sentiment of analyzed comments to see if they have positive, negative, or neutral connotations. To simplify the process as much as possible, we’ll use NLTK’s pre-trained sentiment analyzer.
The first step is to create a SentimentIntensityAnalyzer object. On this object, we’ll call a function called polarity_scores with a single parameter: text to analyze (preprocessed_text column). This function returns a dictionary of four scores: negative, neutral, positive, and compound. The first three scores can’t be negative numbers and must add up to 1. Compound score, on the other hand, is calculated based on them and ranges from -1 to 1.
To classify comments as positive, negative, or neutral, we used the following rule:
- Compound score >=0.05 -> Comment positive
- Compound score<=-0.05 -> Comment negative
- Else -> Comment neutral.
Checking a few examples from all three buckets, it looks like our model did a pretty good job:
- “My baby and I have made it sooooo far in our breastfeeding journey. I never thought I would be able to feed her on demand without any pain. Now it is my favorite time with her” – Positive
- “If you tell me you are now homeless and can’t work because you breastfeed one of your children, then you’re full of sh*t and just don’t want to work. Excuses, excuses, excuses. Millions of breastfeeding moms are in the workplace, ya moron.” – Negative
- “My baby is 3 weeks old, I’m breastfeeding!” – Neutral
As expected, the amounts of positive, negative, and neutral comments do not significantly differ. It’s promising to see, however, that positive comments are at least slightly more frequent than negative ones.
We can further break this down to see if there are days during which the audience is in a particularly good/bad mood.
We seem to be predominantly negative on Tuesdays. According to a study conducted by the London School of Economics, Tuesday is the most depressing day of the week.
Python Code
#imports and downloads import pandas as pd import matplotlib.pyplot as plt import plotly.express as px import nltk nltk.download('stopwords') from nltk.corpus import stopwords from nltk.sentiment.vader import SentimentIntensityAnalyzer nltk.download('vader_lexicon') import re from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator #reading and familiarizing with the data df = pd.read_csv('Breastfeeding_3months_.csv',engine='python') df.info() df.sample(10) #removing unnecessary columns, rows with no text and duplicates df.drop(['status_id', 'display_text_width', 'country'], axis=1, inplace=True) df.dropna(axis=0, subset=['text'], inplace=True) df.drop_duplicates(inplace=True) #converting created_at to datetime df.created_at = pd.to_datetime(df.created_at) #creating comments per date distribution per_date = df.groupby('created_at', as_index=False).agg({'text':'count'}) per_date.sort_values(by='created_at', inplace=True) plt.figure(figsize=(12,7), dpi=150) plt.title('Comments per Date') plt.xlabel('Date') plt.xlim(per_date.created_at.min(), per_date.created_at.max()) plt.ylim(0, 25000) plt.ylabel('# of Comments') plt.plot(per_date.created_at, per_date.text, linewidth=2, color='#5D3FD3') #average daily number of comments per month per_date.resample('M', on='created_at').mean() #creating comments per day distribution per_day = df.groupby('day', as_index=False).agg({'text':'count'}) fig = px.pie(per_day, labels ='day', values='text', title='Comments per Day', names='day', color_discrete_sequence=px.colors.sequential.Purples) fig.update_traces(textfont_size=14, textinfo='label+percent') fig.show() #creating stop words stopwords=stopwords.words('english') stopwords.extend(['\xa0', '>', 'MailOnline', 'Click&', 'amp', 'RT', '2m', 'Ppl', 'gt', 'Much', 'much', 'Yet', 'yet', 'Become', 'become', 'Next', 'next']) #preprocessing comments: casing and removal of URLs and tags df['preprocessed_text']= df['text'].str.lower()url_pattern = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+') for i in df.preprocessed_text.index: df.preprocessed_text[i]= re.sub(tag_pattern, '', df.preprocessed_text[i]) tag_pattern = re.compile('@(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+') for i in df.preprocessed_text.index: df.preprocessed_text[i]= re.sub(url_pattern, '', df.preprocessed_text[i]) #creating word cloud text = " ".join(preText for preText in df.preprocessed_text) wordcl = WordCloud(stopwords = stopwords, background_color='white', max_font_size = 50, max_words = 1000, collocations = False).generate(text) plt.figure(figsize=(20, 20)) plt.imshow(wordcl, interpolation='bilinear') plt.axis('off') plt.show() #sentiment analysis #sentiment label assignment based on compound score def getSentimentLabel(compound_score): if compound_score >= 0.05: return 'Positive' elif compound_score <= -0.05: return 'Negative' else: return 'Neutral' #adding new columns to dataset df = df.assign(sentiment_compound = None) df = df.assign(sentiment_label = None) #using Sentiment Intensity Analyzer combined with getSentimentLabel function for index in df.index: sentence = df.at[index, 'preprocessed_text'] sid = SentimentIntensityAnalyzer() polarity = sid.polarity_scores(sentence) compound_score = polarity['compound'] df.at[index, 'sentiment_compound'] = compound_score df.at[index, 'sentiment_label'] = getSentimentLabel(compound_score) #comments per sentiment distribution sentiment = df.groupby('sentiment_label',as_index=False).agg({'text':'count'}) fig = px.pie(sentiment, labels ='sentiment_label', values='text', title='Comments Sentiment', names='sentiment_label', color ='sentiment_label',color_discrete_map={'Positive':'seagreen', 'Negative':'crimson', 'Neutral':'yellow'}) fig.update_traces(textfont_size=14, textinfo='label+percent') fig.show() #comments’sentiment distribution per week day sentiment_day = df.groupby(['day', 'sentiment_label'],as_index=False).agg({'text':'count'}) px.bar(sentiment_day, x='day', y='text', color = 'sentiment_label', color_discrete_map={'Positive':'seagreen','Negative':'crimson','Neutral':'yellow'}, labels={'text':'# of Comments', 'day':'Day of the Week', 'sentiment_label':'Sentiment'})