Jonah Peretti hired BuzzFeed's first data scientist in 2010, to predict when and how articles would go viral on the Internet. It's a hard problem. We are still thinking about this same question today, but our canvas has changed. BuzzFeed now covers news, politics, business, tech, entertainment, food, international coverage and much more, reaching over 150 million unique visitors a month.
The data science team at BuzzFeed has evolved alongside it. The team of ten data scientists (and growing!) is part of the our first-rate tech group comprised of over 100 talented engineers, product owners, and designers. The scale of our data has increased sharply as well: each month we can examine almost 2 billion views of text, image, and video content created by BuzzFeed in addition to hundreds of millions of data points from third party sources. We play key roles in analysis, modeling, data collection, and the dissemination of insights throughout the company.
We've found ourselves tackling hard to solve problems, finding surprising (and sometimes entertaining) answers and, increasingly, wanting to share and talk about our work with other data scientists and media organizations. So we're starting this blog to serve as a forum for us to talk about interesting data science work that's being done at BuzzFeed and hopefully start conversations about similar projects and challenges we face.
To kick off, we're sharing our general framework of how we think about data science. (Note: please don't confuse the data science team with the data journalism team, although some of these tenets spill over to them as well.) Here are a few things we believe about data.
1. We anonymize all usage data.
First and foremost, we respect the privacy of data. At BuzzFeed, our policy is simple: we anonymize all usage data, have strict internal policies around our employees only accessing data in the aggregate form and are building technical safeguards that would alert us if that policy is breached.
2. Questions matter more than answers.
If you don't ask the right questions then you won't get useful answers. Yet there is an all-too-common assumption that the "standard" questions are the only ones that can be asked. Suppose an editor, wanting to know how his stories are doing, asks for a page-view report. In some organizations a separate "web analytics" team might simply pull this data and send it to him. At BuzzFeed, the data science team handles web analytics; so before pulling the data, we will discuss with the editor what, intuitively, he is trying to understand, and then figure out which metrics best measure that. And if we're not currently gathering data for those metrics, we can take steps to start doing so. This is the simplest example. As problems get more complex, asking the right questions matters even more.
3. Be skeptical about the data.
There is a sadly pervasive belief that data = truth. If numbers are involved, it must be true, and the more numerous the numbers, the truer the truth! The cult of "big data" equates the volume of data with its trustworthiness. The reality is that every data collection scheme is a set of rules coded by humans; any experiment could hide inherent biases; every model's assumptions could be wrong. If the methodology is faulty, then it doesn't matter how much data you have. Size doesn't trump technique; both matter. Data scientists are duty-bound to question the viability of data sets, to reconsider methods of analysis, and to question the degree of "truth" that can be extracted. Only then can we get closer to understanding what is happening, which is always more complex than a single number can hold.
4. Data can tell you what happened, but rarely why.
Let's say we've asked the pertinent questions, set up the least biased experiments, and analyzed the optimal way. Fabulous -- we know something! While we have figured out something that happened, we shouldn't assume that we know why. We can certainly speculate, and design further experiments to test hypotheses, and even ask users with surveys, but it's always unproductive (and usually counter-productive) to think we know more than we actually do.
Sometimes a lot of data will tell you what will likely happen. Predictive analysis is one of our core areas of research. But again, we can create a predictive algorithm that works well and is based on correlations seen in the data, but it doesn't mean we understand the "why" of what we're trying to predict. Correlation and causation are not the same; and we need to think about when it makes sense to act on correlation.
5. Data is only as powerful as the organization behind it.
BuzzFeed has a thriving, effective data science team because the culture of the company allows it. Some examples of how culture is critical to the success of data science:
- Both editorial and business teams are lean and experimental, so they can test data hypotheses fairly quickly. Flexibility is key: having the data is pointless if you can't use the data; as is speed: having the data is equally pointless if you can't use it before it becomes obsolete.
- We've invested in a technology infrastructure that can support data science needs: frameworks for large-scale collection and processing of data, tools and APIs for obtaining and analyzing data, ad-hoc data stores for analyses, and an A/B testing platform.
- Employees in every group and at every level are aware that data (and, more broadly, technology) are core to our success. They also know that data has limits, which leads to the next point.
6. Don’t be a slave to the data.
Data should inform your choices, not determine your strategy. In fact, over-optimization can lead to achieving only a local maximum.
7. There is no one metric to rule them all.
Recent debates about the most important or newest web metric do not distract us. Unique visitors matter, shares matter, front page visits matter, app DAUs and MAUs matter, social media followers matter, traffic source diversity matters, time spent matters, editorial judgement matters, subjective UX, design, and brand perception matter, press pick-up and moving-the-conversation matter, scoops matter, diversity of content matters, and we are probably missing a few others.
BuzzFeed is a combination of art, science, and good judgement. Understanding that balance is a competitive advantage.
8. Data is under-utilized and over-hyped.
Today, it's hard to find a media organization that isn't thinking about data science at some level. People talk about big data, small data, lean data, smart data. We try to not get caught up in the labeling. We try to focus on the problems we're solving. The only way for media organizations to get the most out of data science, to climb out of the trough in Gartner's hype cycle, is to keep questioning, collecting, scrubbing, learning, analyzing, testing, making mistakes, and doing it again.
9. Data is fun!
We are excited about how much more we can still learn, about models that we are still building, about experiments and features that we are still rolling out. Every project is different. We work with people from all departments of the company and with outside researchers too.
Join us in the conversation of how we and other media organizations use data. Our first post is a look at how sharing and reading are correlated on BuzzFeed. And, if you'd like to find yourself at the intersection of Drew Conway's venn diagram, then join our team!