Reddit Quantitative Statistical Study

ABSTRACT 

Reddit, often stylized with a lowercase r as “reddit”,  is a popular website and online-community that allows users to share text and hyperlinks as posts for other users to look at. Users can also comment on the posts in an associated reply thread. Users of Reddit are often referred to as redditors, and will be referred to as such for this study.  

Reddit’s central mechanic has the ability for redditors to vote on posts and comments; one can either upvote, indicating that they enjoyed the post, or downvote, indicating that the did not like the post; the number of downvotes is subtracted from the number of upvotes to give a post/comment its karma score (“About Reddit” 1). Redditors accumulate karma by posting links (text posts do not contribute to karma) and commenting on posts; karma is calculated by summing up all the upvote/downvote scores of a redditor’s posts/comments, these sums are separated into one’s link-karma and comment-karma, representing the overall success of one’s posts or comments.

Reddit is also divided into subreddits, each subreddit has its own topic, theme, and rules, redditors subscribe to subreddits which permits easier access, users can also create new subreddits as long as it has a unique name. Reddit is composed entirely of subreddits, so participation and content generation is done through post to a subreddit, not reddit as a whole (Steinbauer, 2).

Every online community, Reddit included, requires the continued contribution and participation of its users to remain alive. The “Reader-to-Leader framework” proposed by Preece & Shneiderman (2009), which incorporates Legitimate Peripheral Participation (LPP), can help one understand how redditors transition from lurker into active contributors. However, each subreddit has its own norms and rules, and successful participation is difficult until those norms are known. Posting successful content is even more difficult in the large ‘default’ subreddits which receive so many posts, very few people will see one’s post.

The goal of this project is to determine if there is a correlation between the popularity, measured by a post’s karma-score, of a reddit post and the phrasing of the post’s title.  Data about the correlation between a post title and its popularity can help shed light on different patterns that one finds in popular reddit posts, which can potentially help redditors learn what characteristics of a post title are consistent to popular posts and allow them to learn how to formulate reddit posts for more success.

 

reddit

BACKGROUND

The goal of many online communities is to elicit contribution from its member base in a productive and supportive manner.  There are many research projects that have been conducted that strive to depict the inner workings of specific online communities and discover how they function as well as how users engage in these communities.  Some popular communities that have been analyzed include Wikipedia and Reddit.  This study looks to specifically analyze Reddit post titles and post success to try to find a correlation between the two with the goal of finding a key to helping users contribute more successfully to the community. There are many other studies that look to find correlations in a specific network that help to bolster user input; one such study is Robert E. Kraut and Paul Resnick’s investigation into Wikipedia.

Robert E. Kraut and Paul Resnick’s paper, “Encouraging contribution to online communities” describes many different methods for encouraging repeat contribution to communities, but the one that relates best to Reddit is their quote about recording feedback: “feedback and record-keeping about contributions that people make can be a powerful motivator, [it] can act as an informational reward, especially when combined with some kind of goal-setting process” (Kraut, Resnick).  This idea about feedback in the form of popularity has inspired the purpose for this study, which is to determine the correlation between characteristics of a post (specifically its title) and the post’s feedback in the form of up-votes and down-votes. Although this study does not provide direct feedback to specific users like Kraut and Resnick did, the idea is that the conclusions provided from analyzing the phrasing of post titles relative to the specific post’s success have the potential to paint a picture for users of what is entailed in creating a successful post, and with this feedback, hopefully users will take advantage of the analysis to spark more contributions.

This study has the potential to yield new information that has not been published previously.  One previous study conducted by Troy Steinbauer of University of California Santa Barbara used a Reddit Crawler to collect data from all across Reddit that included posts, post titles, subreddit information and comments.  His analysis provides new details on the overall structure of Reddit and describes the development and uses of subreddits, for example, Steinbauer found that the oldest subreddits have the most subscribers and in general are the most active, and he also found there was overlap in many subreddits depending on the topic, for example, city subreddits had overlap with the subreddit for its state.  All of this information is valuable, but there is no analysis of correlation between post characteristics and post success, or really any comparisons looking for what specific characteristics in a post could potentially make it successful.  

This study breaks down each title of collected Reddit posts and analyzes how the title is phrased using n-grams.  An n-gram is a group of n consecutive words in a string. For example “the dog jumped over the fence”, contains the 1-grams: “the”, “dog”, “jumped”, “over”, “the”, “fence” ; the 2-grams: “the dog”, “dog jumped”… ; the 3-grams: “the dog jumped”, “dog jumped over”… ; the 4-grams: “the dog jumped over”… and so on. For each post title we generated its 1,2,3,4, and 5-grams. If an n-gram appears multiple times in a title, it is only counted once. This is to prevent a title, like “scarlett” repeated 30 times, from skewing the n-gram counts, otherwise “scarlett” repeated 30 times would yield 25 counts of the 5-gram “scarlett scarlett scarlett scarlett scarlett”.

N-grams can be used to classify text, and are useful in machine-learning text classifiers. A simplified example would be an algorithm might be aware that certain 2-grams are more likely to occur in a quantum-physics journal article, and rarely occur in medical journal article (likewise some 2-grams occur frequently in medical journals but rarely in quantum-physics journals). Such an algorithm could then employ a Bayesian method to calculate the probability of an article being either from a medical journal or a quantum-physics journal.

The data gathered by this study has the potential to inform a machine-learning algorithm similar to the one described above, that is not this study’s purpose. However, this example of machine-learning with n-grams reveals how n-grams that occur in natural language are often correlated with attributes of the document/object it is associated with. For the purposes of this study, n-grams will be correlated with karma-scores of Reddit posts.

If posts whose titles contain a certain n-gram on average receive a higher karma-score, this might suggest that post title language plays a role in the success, measured by karma-score, of a Reddit post.

 

METHODS

This study used highly automated data collections and data analysis facilitated by the Python programming language and the Python Reddit API Wrapper (PRAW). Python is a general purpose high level programming language, it was selected because we are familiar with it and could be tailored to meet our needs by installing freely available code libraries for extended functionality. PRAW is one such code library that was essential for this study. PRAW provided a high level of abstraction, thus making it straightforward to collect new posts from Reddit, and then review them. Several other Python scripts we created were used to parse posts we collected, and calculate relevant statistics, but did not use PRAW.

 

Data collection

Data collection was broken into two parts. For primary sampling, the first part of data collection, a script implementing PRAW would collect new Reddit posts 100 at a time, every 2 minutes. This script was run several time over the course of week. Depending on how many iterations it was instructed to perform, we instructed it to collect anywhere from 1,000-50,000 posts in a run. When a post was sampled, the script recorded its karma-score and it’s permalink.

The second part of data collection: secondary sampling, occurred 1-2 weeks after primary sampling and was performed by another script. This script first merged all the posts collected from primary sampling into one dataset containing all the posts we sampled. Then, using the permalinks recorded during primary sampling, the script used PRAW to record updated and new information for each post in our dataset. This information included the post’s latest karma-score, title, comment-count, and author username. Occasionally, because posts are occasionally deleted, post permalinks queries by PRAW yielded no results; in these cases, the post’s karma-score, title, and comment-count were given null-values (keyword None in Python).

Because of restrictions on Reddit’s API, PRAW waits two seconds between each request. Therefore secondary sampling was broken into a first segment, which returned a portion of posts to begin analysis, and a second segment that processed the remaining posts from primary sampling.

 

aaaaaaaaaaaaaaaaaaaaaaaaaaaaareddit-marketers-940x492

Data Analysis

First, we created baseline statistics for our data set. These statistics were performed on the entire dataset. The following statistics were calculated: karma-score distributions, average karma-score, karma-score standard deviation, n-gram counts/rankings, comment-count distributions, and subreddit popularity. These statistical distributions informed the next stage of analysis, which examines segments of our sample datasets in greater detail.

Because of the size of the dataset we collected, testing every piece of data would take prohibitively long on the consumer grade computers we are using to analyze. Therefore only posts and n-grams that are above an arbitrary threshold were analyzed. We chose only to compare posts that had karma-scores above the dataset average karma-score against

This study will utilize many different techniques in order to collect data and display results in an acceptable manner.  The study harnesses the power of the Python programming language using the Python Reddit API Wrapper (PRAW) library to collect data on Reddit.  The applicable data that will be analyzed includes post title (specifically the wording of a title) and post ranking.  The beauty of this method of analysis is that the PRAW library anonymizes user data immediately upon retrieval so as to not violate any user privacy.  Moreover, all the data being scraped is purely quantitative in nature, which means there is no risk of accidentally leaking an email address, usernames or quotes from a user. With this means of measuring data, anonymity and confidentiality is secure as we follow the guidelines and recommendations set forth by the Institutional Review Board (IRB).

This study determines the correlation between probability of success  in relation to certain words in a title.  The patterns of words in titles that are used in searching for a correlation come from the probability of the occurrence of n-grams (an n-gram is a n words in sequence).  We will look to determine our measurement validity with this given data and upon further research we will create a operational definition for our measures.  

Data collection employed Reddit’s Application Programmer Interface (API) wrapped in the Python Reddit API Wrapper (PRAW). Using PRAW, we collected new posts of several popular subreddits over the course of several weeks. After this data collection, another script reviews the saved posts and measures their success since collection.  

 

Results & Discussion

The data tables yielded interesting results and patterns for the collected Reddit posts.  For the data collected about post titles, the phrases with the highest average karma score for 1 through 5 gram partitions were: “filtered”,  “and can”,  “to the internet”, “end polluter welfare act”, “taxpayerfunded fossil fuel research and”.  Although these points yielded the highest average karma scores, the number of times they appeared amongst other posts was at most 2 times.  All other high ranking post in each n-gram category yielded similar results where they all had very high karma scores but the phrases were not repeated in any posts.

On the other hand, amongst the collected posts, there were definitely specific n-grams that occurred more frequently than others.  The n-grams that appeared most in the collected data were: “the”, “in the”, “what do you”, “for the first time”, “what do you guys think” all occurring at least 29 times or more with a minimum average karma score of 236.  Unfortunately, “the” and “in the” are not very conclusive n-grams because these words are so common in the English language, hence why “the” occurred 6888 times and “in the” 818 times. A more interesting piece of data is which phrases were the top ranked in terms of number of times they appeared (pulled from the 4 and 5 gram categories).  These phrases were “what do you guys think”, “how do you feel about”,  “what do you think of”, “the day of the doctor”, “thought you guys might like”, “this is what happens when”, and “how do you feel about” all ranking with karma scores over 200 and count appearances ranging from 10 through 60.  These posts definitely show a pattern in regards to popularity in Reddit post titles.

In the overall perspective of the study, there was no single phrase that necessarily dominated all of the Reddit posts, meaning this study cannot conclude that by using a specific phrase, that a user can have a post karma score ranking in the thousands; however, of the phrases with high counts, there is an overall pattern in the tone and perspective of the title.  Five of the seven selected n-grams spoke directly to the audience by using the word “you”, but overall “you” was the 14th ranked 1-gram with a count of 1,878 times.  This data shows that although Reddit posts with a title containing the word “you” are not necessarily the highest rated posts on Reddit, they average karma scores around 200 which shows that overall they are popular.  

There are some drawbacks with the data.  If one looks closely, there are certain overlaps with wording and the n-grams.  For example, “what do you think of” is split in three different ways in the 3-gram tables.  There is also the issue that certain phrases in the 4 and 5-gram tables are eerily similar, but the program could not detect that phrases “what do you think of” and “how do you feel about” are asking nearly identical questions.  If the program could have detected the subject and purpose of phrases, the data could have yielded results more in regards to purposes of titles rather than exact wording.  This shortcoming provides an interesting follow up question of which subjects and purposes in Reddit titles yield the most popularity.  That research question would be interesting to analyze in a new and separate study which would investing further into natural language processing in order to interpret the meaning of posts.

 

Conclusion

Overall, the study did not provide dramatic results.  There was no set of phrases that overwhelmed the masses of Reddit with popularity; however, this does bring about a different set of conclusions.  The phrases that were overwhelmingly popular on Reddit only occurred once (and in some cases twice).  If a phrase only occurs once out of an approximate total of 250,000 posts, the phrase must be extremely unique.  This information allows us to conclude that if a user is looking to have an extremely successful post on reddit with a karma score ranking in the thousands, the post needs to be new, interesting, and unlike anything else on the site.  Although this is interesting information, the real question at hand is how to make it easier for users (particularly new users) to contribute effectively to Reddit with popular posts.

The data collected shows that on a whole, Reddit posts that engage the audience are the most popular.  Common phrases that use second-person pronouns such as “you” and also ask a question tended to appear the most in the subreddits that were analyzed.  They did not necessarily rank amongst the most popular posts on Reddit, but they all had positive karma score ranking above 150 points.

This study had hypothesized that there was a correlation between post title phrasing and post popularity.  

 

 

References

Eysenbach, G., & Till, J.E. 2001. Ethical issues in qualitative research on internet communities. BMJ 323.

Kraut, R.E., & Resnick, P. 2011. Encouraging contribution to online communities. Evidence-based social design: Mining the social sciences to build successful online communities. Cambridge, MA: MIT Press.

Preece, J., & Shneiderman, B. 2009. The Reader-to-leader framework: motivating technology-mediated social participation. AIS Transactions on Human-Computer Interaction (1) 1, pp. 13-32.

“PRAW: The Python Reddit Api Wrapper¶.” PRAW: The Python Reddit Api Wrapper — PRAW 2.1.11 Documentation. N.p., n.d. Web. 11 Dec. 2013. <https://praw.readthedocs.org/en/latest/&gt;.

STEINBAUER, TROY. “INFORMATION AND SOCIAL ANALYSIS OF REDDIT.” Stanford CS224W: Social and Information Network Analysis (Autumn 2011). Standford, Oct. 2011. Web. 11 Dec. 2013. <http://snap.stanford.edu/class/cs224w-2011/proj/tbower_Finalwriteup_v1.pdf&gt;.

“We Power Awesome Communities.” Reddit.com: About Reddit. N.p., n.d. Web. 09 Dec. 2013.

Appendix

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s