How influencers on Reddit form a network of related subreddits?
A network graph analysis on subreddit influencers.
INTRODUCTION
What platform comes to your mind when you heard something about social media?Facebook, Twitter, Instagram? Did you know there are social media platforms other than those mentioned above? You must have heard about Reddit. Reddit is more like social news aggregation and discussion website where users react and comment on different posts. Reddit is made up of thousands and thousands of smaller forums called subreddits. Subreddits are content-centric rather than user-centric. Users subscribe to “subreddits” which are centered around different topics or communities of their interest. These subreddits are not organized in any systematic way though, and Reddit users usually find out about new subreddits through word of mouth meaning from one subreddit they find another new subreddit. And one subreddit influences other subreddit in a direct or indirect way.
Networks are everywhere. Networks or Graphs are a set of objects (called nodes) having some relationship with each other (called edges). We can find out the most influential users by identifying important nodes using network analysis. We will be using the NetworkX library to create graphs and see how users influence other subreddits. What all Redditors are in common between subreddits.
DATA COLLECTION
Importing required Libraries
import numpy as np
import pandas as pd
import praw #for reddit wrapper
import matplotlib.pyplot as plt #for basic visualizations
import networkx as nx #to create Network Graphs
PRAW is a Python wrapper for the Reddit API, which enables you to scrape data from subreddits. Reddit’s API gives you about one request per second, which seems pretty reasonable for small-scale projects.[2] The pandas library is used for data manipulation and analysis and the matplotlib library is used to create some interactive visualizations in python. The Networkx library will be used for network analysis and for creating network graphs.
First, make sure you have PRAW installed in your system. Refer to this link and also we need to create an application to get the keys, follow this tutorial and follow the steps before moving forward.
#Setting up the Reddit API in python
reddit = praw.Reddit(client_id='Your Client ID',
client_secret='Your Client Secret',
user_agent='User')
To extract the user's posts I have created a function; get_posts to get subreddit posts data.
def get_posts(subred_name, n):
subreddit = reddit.subreddit(subred_name)
posts_info = []
for subm in subreddit.top(limit=n):
subred_info = []
subred_info.append(subm.id)
subred_info.append(str(subm.author))
subred_info.append(subm.score)
subred_info.append(subm.upvote_ratio)
subred_info.append(subm.num_comments)
subred_info.append(subm.subreddit)
posts_info.append(subred_info)
sorted_info = sorted(posts_info, key=lambda x: x[1], reverse = True)
posts_df = pd.DataFrame(sorted_info, columns = ['id','author', 'score','upvote_ratio' ,'num_comments', 'subreddit'])
return posts_df
For our analysis, we will be using r/programming which is a computer programming subreddit, and will be creating Network graphs for the same.
Data Cleaning
- We will be taking only those users who have posted more than once
freq_authors = prog_df[prog_df.duplicated(['author'], keep = False)]
2. Getting rid of deleted users
freq_authors = freq_authors[freq_authors.author != 'None']
There are 36 users out of 500 who have posted more than once
plt.figure(figsize=(10, 5))
ax = freq_authors['author'].value_counts().plot(kind='bar',title='Distribution of author/users and their posts')
ax.set_xlabel("Users")
ax.set_ylabel("Number of posts")
The bar graph shows the distribution of 36 users out of 500 who featured posts on r/programming, more than once.
From this bar graph, we will consider the cut-off as 2 (posts on subreddit) to be considered as an influencer.
Now, for further analysis we will create a list of users/authors who were considered as influencers above and see where all these influencers appeared on other subreddits.
authors_lst = list(freq_authors.author.unique())
Now we will get the data for every ‘influencer’ user and get 10 top posts for each and the name of other subreddit they post.
authors_df = pd.DataFrame()
authors_df = authors_df.fillna(0)
for u in authors_lst:
c = get_posts(u, 10)
authors_df = pd.concat([authors_df, c])
This data we will be using to create Network Graph for our analysis.
The graph shows the number of posts made by the influencers on these related subreddits.
NETWORK ANALYSIS USING NetworkX
Network Graphs are a way of structuring, analyzing and visualizing data that represents complex networks, for example social relationships or information flows.[1]
NetworkX is a“high-productivity software for complex networks” analysis. It includes many graph generator functions and facilities to read and write graphs in many formats.
Let’s start by creating a graph
g = nx.from_pandas_edgelist(nx_df, source='author', target='subreddit')
Graph Information
Basic Information
There are 119 nodes and 150 edges. Here nodes are the ‘subreddits’ and edges (which connect two nodes) are the ‘authors/users’ that post on two or more different subreddits.[4]
Density of the Graph
The density of a graph is simply the ratio of actual edges in the network to all possible edges in the network. Here, Density refers to the “connections” between subreddits.
Structure of the Graph
Degree Centrality
The degree centrality for a node v is the fraction of nodes it is connected to.[4] It is used to determine what nodes are most connected. Here we can see the maximum Degree of centrality is of Programming (0.27), which means that most of the nodes are connected to the programming subreddit and then to technology (0.07).
Closeness Centrality
Closeness centrality measures the mean distance from one node to any other node. The more central a node is, the closer it is to all the other nodes. We see that again node programming has the highest closeness centrality, which means that it is closest to the most nodes than all the other nodes.[6]
Betweenness Centrality
Betweenness Centrality, Measures the number of shortest paths that the node lies on. This centrality is usually used to determine the flow of information through the graph.
Eigenvector Centrality
Eigenvector Centrality, Measures the node’s relative influence in the network, or how well a node is connected to other highly connected nodes. From this, we can say that our top 3 influencers can be programming, technology, kurtstir.
GRAPH VISUALIZATIONS
from matplotlib.pyplot import figure
figure(figsize=(10, 10))
nx.draw_shell(g, with_labels=True)
leaderboard = {}
for x in g.nodes:
leaderboard[x] = len(g[x])
s = pd.Series(leaderboard, name='connections')
df_conn = s.to_frame().sort_values('connections', ascending=False)
This shows top subreddits which have a maximum number of connected nodes. programming has to be the one since it’s our main focus.
Network Graph of related subreddits of programming
# Create the graph from the dataframe [5]
g = nx.from_pandas_edgelist(nx_df, source='author', target='subreddit')# Create a layout for nodes
layout = nx.spring_layout(g,iterations=50,scale=2)//https://towardsdatascience.com/an-intro-to-graph-theory-centrality-measurements-and-networkx-1c2e580adf37
sub_size = [g.degree(sub) * 80 for sub in subs] #multiplying by 80 to get circular size
nx.draw_networkx_nodes(g,
layout,
nodelist=subs,
node_size=sub_size,
node_color='powderblue')# Draw all the entities
nx.draw_networkx_nodes(g, layout, nodelist=authors_lst, node_color='green', node_size=100)# Draw highly connected influencers
popular_people = [person for person in authors_lst if g.degree(person) > 1]
nx.draw_networkx_nodes(g, layout, nodelist=popular_people, node_color='orange', node_size=100)nx.draw_networkx_edges(g, layout, width=1, edge_color="lightgreen")node_labels = dict(zip(subs, subs)) #labels for subs
nx.draw_networkx_labels(g, layout, labels=node_labels)
plt.axis('off')
plt.title("Network Graph of Related Subreddits")
plt.show()
Influencer nodes appear green in color. Meanwhile, the subreddits appear in light blue and sized according to their respective number of connections each one has. The influencers who have more connections than just r/programming are highlighted in orange[5]
BUGS ENCOUNTERED
I encountered an error while looping through every “influencer” user and gets top posts for each. I got a 403 HTTP response error. There can multiple reasons behind this error. Check out this link, if you have got this same error.
I got this error because one of the users I fetched was no more on Reddit, it might be because the user was banned or suspended. So, I had to manually check each user and remove the particular user (‘SuperheroNick’)from my data because of which I was getting the error.
#Removing the Author(User), for which posts are not availableauthors_lst.remove('SuperheroNick')
Conclusion And Limitations
From the above network graphs and analysis, we can say that ‘technology’, ‘kurtstir’, ‘serialk’ are among the top influencers that influence r/programming. There are many other subreddits that are connected to each other by users. Subreddits that are closer together on this plot are more likely to have users that comment in both. Also, the bar graph in our analysis showed how common users of r/prrogramming posts on other subreddits. Redditors who read /r/programming also read Linux, ProgrammerHumor, Learn programming, Webdev, technology, etc
Our sample of the top five hundred posts in a subreddit may additionally produce particularly restrained effects as data is not consistent and there can be skewness in the data therefore influencers may be appearing too often.
REFERENCES:
[2] https://medium.com/analytics-vidhya/praw-a-python-package-to-scrape-reddit-post-data-b759a339ed9a
[3] https://www.cl.cam.ac.uk/teaching/1415/L109/l109-tutorial_2015.pdf