Recently I was on a date. I knew we were both into the outdoors -- she likes slacklining, I like rock climbing, and there's a big overlap between the slacklining community and the rock climbing community. We're generally both into backpacking, cooking, yoga, and meditation. There was plenty of common ground.
It was going great. We chatted about our recent outdoor excursions, about my cycling trip in rural North Carolina during the last big cicada emergence; about her new career direction and the motivation for it; about our beer preferences. Eventually, the conversation turned to my other big passion: math and data science. Data science isn't for everyone. While there is some overlap with data science and, say, people who are into yoga, the overlap is small.
Needless to say, this is where the level of engagement started to drop. We returned to our common interests, but ended up parting ways with a vague sense that we had some things in common, but probably not enough.
We all have a qualitative sense that the subject "yoga" is far away from the subject "math". I'd like to try to make this idea of distance rigorous. Can we put a number on it? If we can, we can measure which topics overlap the best with other topics. We could say, quantitatively, which subjects someone who likes yoga will probably also like, and which they won't. You could even say, with slightly more structure, which subjects were "between" different subjects you like.
Since this is a data driven post, I should probably say something about data and privacy. All of our traffic data is anonymized. Even in the anonymized form, this analysis doesn't require aggregating any particular user's viewing history. All of the following analysis is based on population-level co-viewing data. It's impossible to work back from these results to individual level data, as information is lost in the aggregation steps.
We need a way of measuring similarity that goes beyond our intuition. The idea is that the measure can't just be a matter of opinion. We need something rigorous and objective. While the appropriateness of the number to measure might always be debatable, the value of the number should not be.
A behavioral measure of similarity would be "the number of people who like reading about Kim Kardashian who also like reading about math". We could take the number of unique people reading all posts about Kim Kardashian, the number of unique people who read posts about math, and calculate the fraction of people that they have in common (This is called the Jaccard Index. There's a nice wiki article here, for the technically inclined). That gives us a number we can put on it. We expect Kim Kardashian posts will tend to have high overlaps with each other. We expect any Kim Kardashian post will tend to have low overlap with math posts.
We can think of this similarity as measuring how "close" two posts are. Posts with small audience overlaps are far apart. Posts with large audience overlap are close together. It fits with the intuitive idea of "closeness".
We have a way of comparing subjects, and pairs of posts, but haven't really gotten into how to compare the whole universe of posts with each other. Is there some way we can imagine traversing the connections between pairs of posts? Or finding some ordering for posts among each other?
Check out the picture below. We represent each post with a dot. We draw a line between each pair of posts with non-zero audience overlap. We let the thickness of the line be given by the strength of the overlap. We let the points all repel each other, but they're attracted to each other based on the thickness of the lines connecting them. So what does this picture mean? Let's get into some details before returning to the main question.
First, a few technical details: It's easier to analyze and really get the feel for smaller data sets. For that reason, I restricted to editorial content from one vertical. These were 480 posts from the "Tech" vertical on buzzfeed.com from between 1/1/2014 and 10/10/2014.
Now lets get into the fun part: let's zoom in on a little knot of posts at the bottom...
These posts form a knot because there are strong overlaps between the audiences of each pair of posts. Check out the titles of these posts:
They're all about the same subject! They're close to each other because the same people tend to like all of them! That's pretty cool, and it fits with our intuition. These posts that are close together in the picture fit with our idea of what it should mean for them to be "close" to each other.
Now lets check out a few near the top.
What are these posts?
Notice that the whole picture is in a weird sort of "C" shape, with a few posts in the middle of the "C" holding the whole thing together. If we removed those posts, our concept of distance gets very clear. The "C" would unfold, and we'd get a picture where the posts are separated by some number of jumps across the lines between them.
These posts in the middle of the "C" seem to bridge the gaps between the different audience groups. They form a connection between otherwise unconnected groups. So what interest do oculus rift enthusiasts (maroon, just below the middle in the center of the top picture) and Kardashian fanatics have in common? Let's check them out:
So the audience who is interested in these posts contains members whose interests aren't far from most of the rest of the content! It makes sense: you probably care about your internet privacy, whether you're into Kim Kardashian, the latest new Xbox games, or you're looking for tech business news.
What are the colors about?
You may have noticed that the knots tend to be made up of all one color. You also might have noticed that those posts that fall in between everything share edges with multiple colors.
I've taken the whole set of comparisons, and tried to find the best groupings of posts such that there are a lot of dense connections within groups, and fewer connections between groups. This is a way to try to group a posts into highly connected "modules". The hope is that the modules characterize the content belonging to them. So far, it looks like they do a decent job (the algorithm is approximate, and is detailed here ). Some groups are clearly better than others. The flappy bird group is clearly more modular than the diffuse purple group to the left of it.
This approach turns out to be analogous to principal components analysis! You reduce from dimensions equal to the number of posts, to dimensions equal to the number of clusters. There are some more details, with some comments on PCA and modularity optimization here. A great first paper to get the intuition for modularity is here. There are, however, some caveats to using modularity, as detailed here and here. If you're interested in discussing, check out our math meetup!
I should also mention that if our goal from the beginning is accounting for variance in some target variable, then we should used a supervised clustering method to do that, more in analogy with partial least squares (page 79). There are even methods for finding optimal clusterings relative to some diffusion process.
While we didn't make these modules with the aim of predicting anything, it makes sense to try to use them as features for predicting other quantities, like click rates on the content, or maybe the viral lift (the ratio of total referrals to "seed", or promotion-based, referrals). The first question to ask is "Do the groupings actually account for variation in the quantity of interest?". There are a couple of ways to frame this question, but we'll stick with a simple one: is the average of the quantity of interest significantly different within groups from the overall average?
I took the content in each group, found their CTR and lift, and plotted the points corresponding to each post. Then, I faded the (messy) data at the post level into the background. I plotted the average for each group in the foreground, and gave it error bars indicating standard error in the mean.
If the grouping is random, you'd expect the error bars to more-or-less overlap. That happens in the lift dimension, telling us this grouping isn't great for explaining variation in lift. In the CTR direction, on the other hand, we see some pretty good separation.
That top right group is the Kim Kardashian group! It tends to click and share well. The p value for it being different from the average is p = 0.00019, which is significant accounting for multiple testing.
In total, there were 5 of 12 groups that were significantly different from the average CTR. That doesn't imply the others weren't good -- just that either they're not good, or we need more data to see.
So what about this idea of distance? The farthest apart posts have shortest path of 9 jumps. That's pretty far! The average shortest path between all pairs of posts is 4.12 jumps.
We saw how closeness makes sense, but what about far-ness? That's a hard one to examine: a post with no overlap with the others might just be a low-traffic post. To get a decent approximate answer, I restricted to the top 100 most connected posts, and checked which are the farthest apart amongst those (actually, using inverse jaccard index weighted jumps, instead of just jumps). They were
I'm not really sure what the interpretation is there, other than "If your date is talking about bad Sonic fan art, don't talk about internet behavior patterns!" -- this post probably qualifies.
If you're curious about other reasons we care about post similarity, check out this paper for a simple introduction to recommendation systems!