How influencers on Reddit form a network of related subreddits?

Jasneek Chugh

Published in

Web Mining [IS688, Spring 2021]

7 min readFeb 24, 2021

A network graph analysis on subreddit influencers.

The image shows how different channels are related to each other

INTRODUCTION

What platform comes to your mind when you heard something about social media?Facebook, Twitter, Instagram? Did you know there are social media platforms other than those mentioned above? You must have heard about Reddit. Reddit is more like social news aggregation and discussion website where users react and comment on different posts. Reddit is made up of thousands and thousands of smaller forums called subreddits. Subreddits are content-centric rather than user-centric. Users subscribe to “subreddits” which are centered around different topics or communities of their interest. These subreddits are not organized in any systematic way though, and Reddit users usually find out about new subreddits through word of mouth meaning from one subreddit they find another new subreddit. And one subreddit influences other subreddit in a direct or indirect way.

Networks are everywhere. Networks or Graphs are a set of objects (called nodes) having some relationship with each other (called edges). We can find out the most influential users by identifying important nodes using network analysis. We will be using the NetworkX library to create graphs and see how users influence other subreddits. What all Redditors are in common between subreddits.

DATA COLLECTION

Importing required Libraries

import numpy as np 
import pandas as pd
import praw #for reddit wrapper
import matplotlib.pyplot as plt #for basic visualizations
import networkx as nx #to create Network Graphs

PRAW is a Python wrapper for the Reddit API, which enables you to scrape data from subreddits. Reddit’s API gives you about one request per second, which seems pretty reasonable for small-scale projects.[2] The pandas library is used for data manipulation and analysis and the matplotlib library is used to create some interactive visualizations in python. The Networkx library will be used for network analysis and for creating network graphs.

First, make sure you have PRAW installed in your system. Refer to this link and also we need to create an application to get the keys, follow this tutorial and follow the steps before moving forward.

#Setting up the Reddit API in python
reddit = praw.Reddit(client_id='Your Client ID',
                     client_secret='Your Client Secret',
                     user_agent='User')

To extract the user's posts I have created a function; get_posts to get subreddit posts data.

def get_posts(subred_name, n):
    subreddit = reddit.subreddit(subred_name)
    posts_info = [] 
    
    for subm in subreddit.top(limit=n):
        
        subred_info = []
        subred_info.append(subm.id)  
        subred_info.append(str(subm.author)) 
        subred_info.append(subm.score)  
        subred_info.append(subm.upvote_ratio)
        subred_info.append(subm.num_comments)
        subred_info.append(subm.subreddit)
        posts_info.append(subred_info)
    
    sorted_info = sorted(posts_info, key=lambda x: x[1], reverse = True)
    posts_df = pd.DataFrame(sorted_info, columns = ['id','author', 'score','upvote_ratio' ,'num_comments', 'subreddit'])
    return posts_df

For our analysis, we will be using r/programming which is a computer programming subreddit, and will be creating Network graphs for the same.

Data Cleaning

We will be taking only those users who have posted more than once

freq_authors = prog_df[prog_df.duplicated(['author'], keep = False)]

2. Getting rid of deleted users

freq_authors = freq_authors[freq_authors.author != 'None']

There are 36 users out of 500 who have posted more than once

plt.figure(figsize=(10, 5))
ax = freq_authors['author'].value_counts().plot(kind='bar',title='Distribution of author/users and their posts')
ax.set_xlabel("Users")
ax.set_ylabel("Number of posts")

Distribution of users and their posts on r/programming

The bar graph shows the distribution of 36 users out of 500 who featured posts on r/programming, more than once.

From this bar graph, we will consider the cut-off as 2 (posts on subreddit) to be considered as an influencer.

Now, for further analysis we will create a list of users/authors who were considered as influencers above and see where all these influencers appeared on other subreddits.
authors_lst = list(freq_authors.author.unique())

Now we will get the data for every ‘influencer’ user and get 10 top posts for each and the name of other subreddit they post.

authors_df =  pd.DataFrame() 
authors_df = authors_df.fillna(0)
for u in authors_lst: 
    c = get_posts(u, 10)
    authors_df = pd.concat([authors_df, c])

This data we will be using to create Network Graph for our analysis.

Figure shows on which subreddit the influencers posts the most — Figure shows on which subreddit the influencers post the most

The graph shows the number of posts made by the influencers on these related subreddits.

NETWORK ANALYSIS USING NetworkX

Network Graphs are a way of structuring, analyzing and visualizing data that represents complex networks, for example social relationships or information flows.[1]
NetworkX is a“high-productivity software for complex networks” analysis. It includes many graph generator functions and facilities to read and write graphs in many formats.

Let’s start by creating a graph

g = nx.from_pandas_edgelist(nx_df, source='author', target='subreddit')

Graph Information

Basic Information

There are 119 nodes and 150 edges. Here nodes are the ‘subreddits’ and edges (which connect two nodes) are the ‘authors/users’ that post on two or more different subreddits.[4]

Density of the Graph

The density of a graph is simply the ratio of actual edges in the network to all possible edges in the network. Here, Density refers to the “connections” between subreddits.

Structure of the Graph

Degree Centrality

Top 10 Subreddits with maximum Degree Centrality

The degree centrality for a node v is the fraction of nodes it is connected to.[4] It is used to determine what nodes are most connected. Here we can see the maximum Degree of centrality is of Programming (0.27), which means that most of the nodes are connected to the programming subreddit and then to technology (0.07).

Closeness Centrality

Top 10 Subreddits with maximum Closeness Centrality

Closeness centrality measures the mean distance from one node to any other node. The more central a node is, the closer it is to all the other nodes. We see that again node programming has the highest closeness centrality, which means that it is closest to the most nodes than all the other nodes.[6]

Betweenness Centrality

Betweenness Centrality, Measures the number of shortest paths that the node lies on. This centrality is usually used to determine the flow of information through the graph.

Eigenvector Centrality

Eigenvector Centrality, Measures the node’s relative influence in the network, or how well a node is connected to other highly connected nodes. From this, we can say that our top 3 influencers can be programming, technology, kurtstir.

GRAPH VISUALIZATIONS

from matplotlib.pyplot import figure
figure(figsize=(10, 10))
nx.draw_shell(g, with_labels=True)

The figure showing how a subreddit is connected to another subreddit. r/programming has maximum connected nodes.

leaderboard = {}
for x in g.nodes:
 leaderboard[x] = len(g[x])
s = pd.Series(leaderboard, name='connections')
df_conn = s.to_frame().sort_values('connections', ascending=False)

This shows top subreddits which have a maximum number of connected nodes. programming has to be the one since it’s our main focus.

Network Graph of related subreddits of programming

# Create the graph from the dataframe [5]
g = nx.from_pandas_edgelist(nx_df, source='author', target='subreddit')# Create a layout for nodes 
layout = nx.spring_layout(g,iterations=50,scale=2)//https://towardsdatascience.com/an-intro-to-graph-theory-centrality-measurements-and-networkx-1c2e580adf37
sub_size = [g.degree(sub) * 80 for sub in subs] #multiplying by 80 to get circular size
nx.draw_networkx_nodes(g, 
                       layout, 
                       nodelist=subs, 
                       node_size=sub_size, 
                       node_color='powderblue')# Draw all the entities 
nx.draw_networkx_nodes(g, layout, nodelist=authors_lst, node_color='green', node_size=100)# Draw highly connected influencers 
popular_people = [person for person in authors_lst if g.degree(person) > 1]
nx.draw_networkx_nodes(g, layout, nodelist=popular_people, node_color='orange', node_size=100)nx.draw_networkx_edges(g, layout, width=1, edge_color="lightgreen")node_labels = dict(zip(subs, subs)) #labels for subs
nx.draw_networkx_labels(g, layout, labels=node_labels)
plt.axis('off')
plt.title("Network Graph of Related Subreddits")
plt.show()

r/programming and its neighboring nodes, which influences its content

Influencer nodes appear green in color. Meanwhile, the subreddits appear in light blue and sized according to their respective number of connections each one has. The influencers who have more connections than just r/programming are highlighted in orange[5]

BUGS ENCOUNTERED

I encountered an error while looping through every “influencer” user and gets top posts for each. I got a 403 HTTP response error. There can multiple reasons behind this error. Check out this link, if you have got this same error.

I got this error because one of the users I fetched was no more on Reddit, it might be because the user was banned or suspended. So, I had to manually check each user and remove the particular user (‘SuperheroNick’)from my data because of which I was getting the error.

#Removing the Author(User), for which posts are not availableauthors_lst.remove('SuperheroNick')

Conclusion And Limitations

From the above network graphs and analysis, we can say that ‘technology’, ‘kurtstir’, ‘serialk’ are among the top influencers that influence r/programming. There are many other subreddits that are connected to each other by users. Subreddits that are closer together on this plot are more likely to have users that comment in both. Also, the bar graph in our analysis showed how common users of r/prrogramming posts on other subreddits. Redditors who read /r/programming also read Linux, ProgrammerHumor, Learn programming, Webdev, technology, etc

Our sample of the top five hundred posts in a subreddit may additionally produce particularly restrained effects as data is not consistent and there can be skewness in the data therefore influencers may be appearing too often.