CONTAGION—How Twitter got so hot in academic research

This article is part of our Contagion package, a series that explores the science of how things spread.

A couple years ago, Sherry Emery, a health economist at the University of Illinois at Chicago, found herself reading tweets about “smoking hot girls.” Also about “smoking ribs,” “smoking weed,” and the “smoking chimney” of the papal conclave. If she got lucky, they’d be about “smoking squares” or just “smoking,” in an easily decoded context that referred to cigarettes.

Emery has studied the impact of tobacco-related advertising for years. Until recently, that meant looking at TV and radio spots, tracking Nielsen Ratings and regional smoking rates. But then, one night watching Netflix in 2011, she had a thought: if she was on the web, so were many others—and they were likely leaving a trail of their attitudes towards smoking on social media platforms such as Twitter.

In September 2011, the National Cancer Institute awarded her a $7.2 million grant to look into it—and so she went, a pioneer (in her line of work) into the brave new world of Twitterology.

She’s hardly alone these days. Since Twitter was founded in 2006, academics have flocked to the micro-blogging platform—not to tweet messages (though some do that too), but to study them. With 225 million users issuing half a billion tweets per day, Twitter represents the richest dataset to hit academia….well, maybe ever—a virtual Petri dish of real-time data, attractive to scholars of all disciplines, for studies of all sorts. Physicists have used Twitter to study networks; psychologists to study narcissism; linguists to study regional language variation. There are research papers about what can be learned by using Twitter to track dental pain, air quality and public concern about flu outbreaks—as well as studies on Twitter’s potential to predict the outcome of NFL games, and diagnose post-traumatic stress disorder, and measure worldwide happiness. In all, some 2,000 journal articles and 3,000 conference papers have been written about Twitter (or have at least contained the word in their title, keywords or abstract), according to Scopus, a database of academic publications. There’s even a paper, published in 2013 in the Journal of Documentation, entitled, “What do people study when they study Twitter? Classifying Twitter related academic papers.”

The social networking site is not the most likely of tools to have caught fire in the Ivory Tower. How did Twitter, a site that traffics in 140-character-or-less messages and that counts two pop stars—Katy Perry (with 55.6 million followers) and Justin Bieber (with 53.6 million)—as its most influential users, become so hot among the academic set?

In this series on contagion, my FORTUNE colleagues and I set out to explore how things spread—from M&A rumors, to market panics, to the ‘selfie’. And for the final installment of this series, we decided to get especially meta. After all, how better to probe the anatomy of a social epidemic than to track how Twitter, one of the preferred tools for studying contagion these days, got so contagious among people studying it?

The story begins in the not-too-distant past with computer scientists. Even more than most academics, computer scientists need data—and for years, they’ve mined whatever odd and interesting datasets have come their way. The Enron emails—the 600,000 some messages belonging to 158 Enron employees and made public by the Federal Energy Regulatory Commission after its investigation of the company—became popular fodder in the field, for example, after they were released in 2003.

Social media may seem an obvious next frontier for data-minded academics, but when computer scientist Jennifer Golbeck first started studying such platforms in 2003 (she was inspired by MySpace), it was not considered particularly promising or serious work. Colleagues in her highly technical field dismissed it as “social science”; and in the nascent universe of online social networks, the largest was a hook-up site with a community of 20 million members called AdultFriendFinder.

Golbeck, a Ph.D. student at the time, saw greater potential in such platforms: “There was so much interesting computing to be done,” she says. But she was still battling to convince computer science departments of this when she completed her degree in 2005.

Now a professor at University of Maryland, College Park, Golbeck heads up the school’s Human-Computer Interaction Lab and continues to study what can be learned about humans and relationships using social media. Her prolific output has included papers on “the sense and structure of community on YouTube,” how Congressional representatives use Twitter, and the dynamics of the human-pet relationship (many platforms). That work makes her much in demand—her TED talk, “The Curly Fries Conundrum: Why social media likes say more than you might think,” has been viewed 1.2 million times since October 2013.

Eytan Adar, now an assistant professor of information and computer science at the University of Michigan was another pioneer. Years ago he used blogs to study how memes spread and, in 2007, he co-founded the International Conference on Weblogs and Social Media in an effort to build community among researchers doing similar work. That year, the event drew 145 people, offered talks like “Building Trust on Corporate Blogs” and “Social Browsing on Flickr,” and featured Ev Williams, the founder of a then-fledgling start-up called Twitter, as the keynote speaker. (Like Twitter, the conference has grown a lot since then.)

The first academics to study Twitter tended to be computer scientists like Golbeck and Adar, who had both the savvy to understand Twitter and the tech skills to collect and manipulate its data, as well as physicists and information science and communications scholars who were particularly interested in network effects. Research from those early years tended to focus on Twitter—statistical analyses of how and for what the service was used. Then came more sophisticated studies focused on the mechanics of Twitter: the study of things like “unfollow dynamics,” “transient crowd discovery,” or “patterns in Twitter intra-topic user and message clustering.” Later to the party were social scientists, like Emery, who dreamt up applications for the data—predicting the outcome of elections, for instance, or elucidating the narcissism of Twitter’s college-aged users—but tended to be less technically adept at collecting and manipulating it. (As a result, a number of interdisciplinary research efforts—like those that take place in Golbeck’s lab—have sprung up.)

According to the study, “What do people study when they study Twitter?,” the number of Twitter-focused papers has grown from 3 in 2007, to 8 in 2008, to 36 in 2009, and is up considerably since then.

“Early adopters in the social sciences of research data from Twitter were just mocked,” says Stuart Shulman, the CEO of Texifter, a developer of text analysis tools and a vendor of Twitter data that often licenses it to academics. Seasoned academics tended to be incredulous towards these (mostly) younger colleagues, he says. “Why would you do that? You can’t get tenure using that? Now there’s a whole generation coming through grad school that are going to write their masters theses about social data.”

These days, becoming a doctor of social data looks like a secure line of work. Just as the number of papers based on Twitter research has soared, so has the number of conferences inviting academics to submit their findings. Indeed, Adar’s International Conference on Weblogs and Social Media’s annual conference now competes with a number of rival meetings.

What has made Twitter so popular with academics, though, isn’t just that it’s an enormous public dataset, it’s that it’s an enormous public dataset with a time scale—capturing thoughts from millions of people on all matters of subjects recorded in specific time (and, in some cases, specific space). You might think there’d be limitations to the things people would say, or tweet, on a public stage—okay, scratch that: we all know better. You might think there’d be virtually no limitations to the things people would say, or tweet, on a public stage, and you’d be right: folks on Twitter are so unfiltered, in fact, that health researchers are using the platform to track food-poisoning outbreaks. (Take a moment to figure that one out….)

Such properties set Twitter apart from other data-rich social networking sites. Facebook, for example, has privacy issues and rolls out content, not chronologically but according to the funky algorithm of its NewsFeed.

That’s not to say academic research with Twitter is particularly easy. While Twitter is a public platform, only a fraction of its data, or 1% of the Twitter stream—Twitter calls it the “spritzer”—is free and accessible to the public through Twitter’s application programming interface (API). Some select partners—some of whom are academics—have negotiated slightly more robust access via Twitter’s “garden hose” (10% of the stream). Complete access, via the Twitter firehose or even unlimited access to particular search queries, is costly and can be obtained only through a handful of vendors. (While the Library of Congress warehouses the whole Twitter archive, it does not have the capacity to address the many data requests it receives.)

Twitter, to much excitement and fanfare, announced a data grant program earlier this year to help academics shoulder the costs of such research. In truth, the company barely opened the spigot: of 1300 applicants, just six, or 0.5%, were awarded grants. Texifter is now making similar grants to a total of 36 research teams.

Academics using the platform for research are certainly getting better at it. Data filtering techniques are getting more precise and sophisticated. Meanwhile, scholars are learning what sort of research Twitter is good for. Adar says the platform’s data is best for understanding what’s going on in a particular place at a particular instant; it’s a less proven (yet more highly sought-after) tool for prediction.

There also remain concerns about just how representative the Twitter data sample is. As one scholar, who dabbles in Twitter research told me, it’s hard to know how much you’re watching human behavior versus how much you’re watching human behavior on Twitter.

“Maybe it’s a fad, and maybe we’ll determine that studying the five million active users of Twitter, and talking about whole world is really kind of stupid,” says Texifter’s Shulman. “But I don’t think so. You’d be an idiot if you said Twitter doesn’t matter.”

Or, maybe Twitter does matter…but it’s still a fad. Adar has already seen signs that the platform isn’t as hot among academics as it used to be. “There’s still a lot of research on Twitter,” he says.“But some attention has shifted to other social media. When too many people are studying one thing, we have to move on to, you know, try to make novel contributions.”