Big Data in communication research Its contens dan discontants

Journal of Communication ISSN 0021-9916

AFTERWORD

Big Data in Communication Research: Its
Contents and Discontents
Malcolm R. Parks
Department of Communication, University of Washington, Seattle, WA, 98195, USA

doi:10.1111/jcom.12090

I had two goals in mind when I decided to dedicate a special issue of the Journal of
Communication to “Big Data.” One was to provide an outlet for the growing number
of excellent Big Data studies on mass communication, digital technologies, political
communication, health communication, and many other areas of interest to our discipline. My focus was on empirical papers that made substantive contributions using
new methods, rather than on explanations, endorsements, or critiques of the Big Data
movement. The goal was to showcase the state of the art in recent research in computational communication science.
My second goal was to provide a benchmark for research innovation. Big Data
research is still in its infancy in communication. Relatively little of the work done in
this early stage will stand the test of time, but all of it will likely be critical in the on
going process of conceptual and methodological advance. The articles featured in this

issue represent the best of what is currently being done. Their strengths will guide
future work, but so, too, will their limitations.
What is Big Data?

There is no one definition of Big Data. Thought about in simple terms, Big Data
involves datasets that are far larger than those traditionally examined in journals like
this one. Yet there has always been considerable variation in the size of datasets, ranging from small experimental studies to large samples involving census or polling data.
Size alone is therefore an insufficient descriptor. In more substantive terms, the Big
Data movement has been associated with the analysis of large social networks (including online networks such as Twitter), automated data aggregation and mining, web
and mobile analytics, visualization of large datasets, sentiment analysis/opinion mining, machine learning, natural language processing, and computer-assisted content
analysis of very large datasets. Several of these methods are featured in this issue.

Corresponding author: Malcolm R. Parks; e-mail: [email protected]
Journal of Communication 64 (2014) 355–360 © 2014 International Communication Association

355

Big Data, Contents and Discontents

M. R. Parks


As others have observed, the Big Data movement often brings along an ideology or
mythology that asserts a special, transformative value (e.g., boyd & Crawford, 2012).
In order to evaluate these claims, it is necessary to distinguish the true promise of Big
Data from some of the over promises of its most fervent proponents.
Separating promise from poses

Big Data methods and sources will become increasingly important because they offer
data and insights that could not be obtained in other ways. These methods open
research to work involving datasets of previously unimagined size. Indeed they often
provide the only means of managing and analyzing digital datasets of increasing size
and complexity. The entry by Baek, Park, and Cha, for instance, begins with a scan of
approximately 1.7 billion tweets. Even after the most relevant data are selected by these
and the other authors represented in this issue, the sample sizes typically remain in
the hundreds of thousands. The ultimate value of Big Data, however, derives not from
sheer size, but rather from two other factors.
First, because the Big Data movement is coupled with what is sometimes called
“datafication,” that is, the creation of quantitative datasets from information that has
not been viewed as data in the past (Mayer-Schönberger & Cukier, 2013), it leads to
new research questions and new ways of thinking about existing questions. Among

the many examples is the relatively large social network that Christakis and Fowler
(2007, 2009) constructed from previously overlooked participant tracking information in the long-running Framingham heart study. In this issue, we might think of
Giglietto and Selva’s creative analysis of messages tweeted by television viewers as an
example of rarely examined discourse. We might also point to Hill and Shaw’s substantive appropriation of administrative data in wikis.
Big Data can open new doors in a second way as well. Its computational tools
enhance researchers’ ability to bring together multiple datasets—datasets of different times, from different places, or gathered at different times. This ability has always
existed on a small scale, but new data management and analytic capabilities make
it possible to conduct research of unprecedented complexity and scope. Several of
the studies here have done just that. One of the more striking examples is Jungherr’s
analysis combining Twitter content, separate content analyses of print and television
coverage, and public opinion polling related to the 2009 federal elections in Germany.
Together, datafication (i.e., the construction and sharing of multi faceted datasets) and
the development of new analytic tools to work on them hold dramatic promise for our
discipline.
In order to realize this promise, however, it is necessary to place Big Data in a
larger intellectual and disciplinary context. This requires looking beyond much of the
hyperbole about the “Big Data Revolution.” Among the most extreme claims is the
assertion that Big Data will render science itself obsolete, or at least no longer in need
of theory, models, or interpretation. “With enough data, the numbers speak for themselves” (Anderson, 2008). Others claim that simple correlations will be sufficient in the
356


Journal of Communication 64 (2014) 355–360 © 2014 International Communication Association

M. R. Parks

Big Data, Contents and Discontents

Age of Big Data, that hypothesis testing and causal analysis will no longer be necessary to advance science (Mayer-Schönberger & Cukier, 2013). It is fair to say that such
positions are intended to be provocative, often in service of the authors’ market interests. A more realistic view might be to acknowledge the value of large-scale datasets,
while at the same time recognizing that the choice of data (even Big Data) always
reflects at least an implicit theoretic model and that the desire for explanation will
continue to lead scientists toward causal analysis and experimentation (even though
some experiments may now become very large).
A more subtle, but still misleading view of Big Data is that it presents a sharp
break from the past or possibly even a new science. The term “data science” is particularly unfortunate in this regard, both because of its redundancy, and because of
the way it obscures the fact that Big Data’s value ultimately depends on disciplinary
and interdisciplinary utility. Kuhn’s (1962) observation that substantive advances and
methodological advances are more often intertwined than independent is no less true
today than it was 50 years ago. This suggests that the impact of “data science” specialists will depend on their ability to create value for those engaged with substantive
disciplinary and interdisciplinary issues.

Big Data is not so much a break from the past as simply the latest in a more or
less steady flow of methodological advances that have transformed the social sciences
over the past 100 years. These include the codification of experimental design, the
development of systematic sampling and surveys, the advent of multivariate statistical analysis, the development of searchable compilations of media content, and video
recording, to name just a few. We might also keep in mind that perceptions of bigness
are themselves relative and historically bound. Several of the innovations mentioned
above were the big data revolutions of their day.
Making the most of Big Data

Placing the Big Data movement in disciplinary and historical context enables us to
attend to the issues that must be addressed if progress is to be made. Four issues would
benefit from greater attention in my view.
Greater attention to questions of theoretic and social importance

One might imagine three stages in the adoption of new research methods. Studies
done during the initial stage emphasize the methods themselves. Many are essentially
demonstration projects. Much of the current Big Data work in the social sciences,
including communication, is still at this first stage. Next, investigators begin to apply
new methods to smaller problems or well established findings. Many of the findings
will essentially replicate previous work or address questions of secondary importance.

These studies may be useful substantively and provide guides for those working in
more central areas. Yet they will often be limited because they often rely on the data
that are available rather than on the data that are needed. Finally, new methods move
into the mainstream as investigators begin to apply them to theoretically and socially
important problems.
Journal of Communication 64 (2014) 355–360 © 2014 International Communication Association

357

Big Data, Contents and Discontents

M. R. Parks

We selected manuscripts for this issue with this third stage in mind. Although
the chosen studies vary, each clearly grapples with an issue of interest within our
research community. Studies by Jungherr, by Neuman and colleagues, and by Vargo
and colleagues bring new approaches to understanding central questions regarding
the nature and timing of influence between online social media and more traditional
media. Colleoni and colleagues examine the theoretically important question of
whether the structure interaction on Twitter brings users into contact with diverse

perspectives or merely creates an “echo chamber” of likeminded voices. Emery
and her colleagues open a new window for considering the theoretic and socially
important issue of how public health campaigns work.
Advancing toward this higher stage will inevitably bring changes in patterns of
graduate education and collaboration. Just as media and communication researchers
in the 1970s sought training in multivariate analysis from those outside the discipline, we now reach out to those with the computational skills. But we need not go
with hats in hand. It is clear that we have much to offer in terms of substance, substance often lacking in the demonstration projects so often found in computationally
oriented work. Our contribution becomes even more critical when research sponsors
begin to demand that the makers of new tools demonstrate their societal value.
Greater concern for validity of measurement

In many of the submissions we received, researchers selected the large-scale indicators
they could and were then left in the position of trying to attribute broader conceptual meaning or importance to operational indicators of convenience rather than of
choice. Even more difficult problems arise when a given operational indicator appears
to be valid, but is too limited to capture the full richness of the concept it presumably
measures.
Progress depends as well on providing stronger evidence to support the validity of automated coding systems, machine learning algorithms, sentiment analysis,
and the other new tools rapidly entering the research sphere. The paper by Emery
and colleagues offers a good example of what is necessary to validate machine-coding
procedures. Other papers, including many of those we turned away, either relied on

coding validation procedures that were not tailored to the specific research situation
or the authors simply assumed that previous, often very limited, validation efforts
were sufficient. Here we must guard against the error of equating very detailed technical descriptions of procedures with evidence of validity. Very detailed procedures
and algorithms are not necessarily any more valid than more straightforward ones.
Indeed, because more assumptions are made, there is more to go wrong.
Greater attention to sampling and representativeness

Big Data is not complete data. This can be seen in the articles in this issue. In nearly
every case investigators have started with a dataset that represented only a portion of
the sample universe of interest and have then focused on a still smaller portion of the
sample universe. The article by Giglietto and Selva provides us with an illuminating
358

Journal of Communication 64 (2014) 355–360 © 2014 International Communication Association

M. R. Parks

Big Data, Contents and Discontents

example. Their dataset of tweets (N = 2.49 million) related to political talk shows for

the 2012/2013 season is described as complete. Upon closer inspection, however, it
is apparent that the dataset only contains tweets that included official or the most
popular hashtags for the programs of interest. As Jungherr notes in his article, the
choice to sample Twitter messages using hashtags may slant the sample toward more
experienced users. Giglietto and Selva based their final analyses on a much smaller
dataset intended to reflect tweets during peaks of activity. This is not intended to be
critical and indeed, to their credit, the authors are quite candid about the limitations
of the final dataset. The larger point is that even very large datasets often represent
samples whose generalizability and representativeness is open to challenge. Bigness
does not ensure quality.
It is striking that seven of the eight papers selected for this issue rely entirely or
in part on Twitter data. Although Twitter users in the United States increasingly mirror its online population in basic demographic terms (Brenner & Smith, 2013), we
know much less about the demographics of Twitter users in most other countries,
particularly those in the developing world. As Baek and his colleagues acknowledge,
this leaves cross-cultural comparisons of Twitter use and content open to concerns of
sampling bias. Beyond this, however, there is no reason to assume that molar demographic similarities between Twitter users and the overall online population imply
similarities in attitudes, issues discussed, or several of the other more specific issues
addressed in this issue.
In addition to concerns about how representative Twitter users are, we should also
be concerned about Twitter’s ability to represent social media platforms more generally. It is an appropriate choice on substantive grounds in some cases, but not in others,

or at least not as a sole choice. Twitter was an excellent choice for Giglietto and Selva’s
analysis of “second screen” interaction, though one might acknowledge that television viewers also interact with one another via direct texts, e-mail, and cellphone. As
digital venues proliferate, it will become increasingly important to analyze more than
one medium, just as those interested in media coverage of issues more generally now
are encouraged to consider both broadcast and print media. The study by Neuman,
Guggenheim, Jang, and Bae offers an outstanding example of an analysis using multiple traditional and social media. In some other cases, it is fair to ask if Twitter data
were representative of the larger, more diverse media streams substantively related to
the authors’ research questions. This is a legitimate question for any study that is based
on a single digital media platform, again, regardless of the amount of data drawn from
that platform.
Enhancing data access and ensuring data quality

Several commentators have raised concerns about the fact that much of the Big Data of
greatest interest to social scientists, particularly communication and media scholars,
is the property of commercial entities such as Facebook, Twitter, and Google. These
companies either deny or tightly manage data access by researchers, leading to fears
of “new digital divides” and the creation of classes of researchers who are either “data
Journal of Communication 64 (2014) 355–360 © 2014 International Communication Association

359


Big Data, Contents and Discontents

M. R. Parks

rich” or “data poor” (e.g., boyd & Crawford, 2012). These are legitimate fears and
ought to be a source of alarm for everyone in the research community as more and
more of our social life is conducted within commercially owned walled gardens.
But the rhetoric of digital divides fails to capture the full range of the danger.
As communication researchers begin to work with the owners of social networking sites and other proprietary venues, they may well begin to experience the same
challenges that biomedical researchers have experienced working with commercial
entities making drugs and medical devices. Communication researchers may have to
contend with the fact that companies will grant access only to data that they believe
will reflect positively upon their commercial interests. They will discover, as biomedical researchers have, that sponsorship and assistance often comes with strings. Sometimes these strings are explicit, as in the case of a company demanding the right to
approve manuscripts before they are submitted for publication. Sometimes, the strings
will be implicit, as in cases where researchers are biased by their own desire to please
or to gain visibility through association with a trendy company or industry group.
In extreme cases, there may be direct conflicts of financial interest when investigators have ownership or extensive consulting relationships with the companies whose
products they study.
Significant challenges therefore face us as we move into the era of Big Data. Some
are new, but fortunately most of them are the same challenges that have been faced
with major methodological innovations in the past. Looking past claims of exceptionalism will help us recognize the road ahead. Moving forward holds the potential
for not only examining existing questions in new ways, but for positioning the discipline of communication at the heart of efforts to understand social and civic life in
an increasingly mediated age. The challenges are familiar; the theoretic and practical
potential is enormous.
References
Anderson, C. (2008). The end of theory: The data deluge makes the scientific method
obsolete. Wired [WWW document]. Retrieved from http://www.wired.com/science/
discoveries/magazine/16-07/pb_theory.
boyd, d., & Crawford, K. (2012). Critical questions for big data. Information, Communication
& Society, 15, 662–679. doi:10.1080/1369118X.2012.678878.
Brenner, J., & Smith, A. (2013). 72% of online adults are social networking site users.
Washington, DC: Pew Research Center’s Internet & American Life Project. Retrieved
from http://pewinternet.org/∼/media//Files/Reports/2013/PIP_Social_networking_sites
_update_PDF.pdf.
Christakis, N., & Fowler, J. H. (2007). The spread of obesity in a large social network over 32
years. New England Journal of Medicine, 357, 370–379. doi:10.1056/NEJMsa066082.
Christakis, N., & Fowler, J. H. (2009). Connected: The surprising power of our social networks
and how they shape our lives. New York, NY: Little, Brown.
Kuhn, T. (1962). The structure of scientific revolutions. Chicago, IL: University of Chicago
Press.
Mayer-Schönberger, V., & Cukier, K. (2013). Big Data: A revolution that will transform how
we live, work, and think. Boston, MA: Houghton Mifflin Harcourt.
360

Journal of Communication 64 (2014) 355–360 © 2014 International Communication Association