Big Data’s Role in Health and Risk Messaging
- Bradford William HesseBradford William HesseHealth Communication & Informatics Research Branch Behavioral Research Program Division of Cancer Control and Population Sciences, National Cancer Institute
The presence of large-scale data systems can be felt, consciously or not, in almost every facet of modern life, whether through the simple act of selecting travel options online, purchasing products from online retailers, or navigating through the streets of an unfamiliar neighborhood using global positioning system (GPS) mapping. These systems operate through the momentum of big data, a term introduced by data scientists to describe a data-rich environment enabled by a superconvergence of advanced computer-processing speeds and storage capacities; advanced connectivity between people and devices through the Internet; the ubiquity of smart, mobile devices and wireless sensors; and the creation of accelerated data flows among systems in the global economy. Some researchers have suggested that big data represents the so-called fourth paradigm in science, wherein the first paradigm was marked by the evolution of the experimental method, the second was brought about by the maturation of theory, the third was marked by an evolution of statistical methodology as enabled by computational technology, while the fourth extended the benefits of the first three, but also enabled the application of novel machine-learning approaches to an evidence stream that exists in high volume, high velocity, high variety, and differing levels of veracity.
In public health and medicine, the emergence of big data capabilities has followed naturally from the expansion of data streams from genome sequencing, protein identification, environmental surveillance, and passive patient sensing. In 2001, the National Committee on Vital and Health Statistics published a road map for connecting these evidence streams to each other through a national health information infrastructure. Since then, the road map has spurred national investments in electronic health records (EHRs) and motivated the integration of public surveillance data into analytic platforms for health situational awareness. More recently, the boom in consumer-oriented mobile applications and wireless medical sensing devices has opened up the possibility for mining new data flows directly from altruistic patients. In the broader public communication sphere, the ability to mine the digital traces of conversation on social media presents an opportunity to apply advanced machine learning algorithms as a way of tracking the diffusion of risk communication messages. In addition to utilizing big data for improving the scientific knowledge base in risk communication, there will be a need for health communication scientists and practitioners to work as part of interdisciplinary teams to improve the interfaces to these data for professionals and the public. Too much data, presented in disorganized ways, can lead to what some have referred to as “data smog.” Much work will be needed for understanding how to turn big data into knowledge, and just as important, how to turn data-informed knowledge into action.
The Ubiquity of Data Systems
Consciously or not, individuals interact with complex data systems daily. A person who starts the day by listening to the morning weather report is consuming information processed nearly in real time from data streamed at high velocity from remote meteorological sensing devices. These data are combined to form a manageable input stream, processed against computer models to forecast future weather states, and then broadcast through local news outlets to inform daily decision-making. Tech-savvy consumers might refer to their mobile phones or tablet apps for a real-time check-up on daily—or hourly—projections of weather conditions. Although most users of weather information have come to understand that the predictive power of data-based forecasts are by their very nature probabilistic, they nevertheless depend heavily on these forecasts to make real-life decisions over the course of their day—how to dress for work, what route to take when traveling, and what appointments to cancel if inclement weather threatens. These same people may drive to work guided by high-velocity telemetry data streamed into a global positioning system (GPS) sitting unobtrusively on their car’s dashboard, with real-time communication screens alerting them to upcoming traffic patterns and a synthesized voice providing step-by-step guidance designed to support high-stakes, short-response-time decision-making. After work, they may come home and in their spare time, go online to make a purchase from an Internet retailer or stream entertainment options from a digitally connected content producer. These activities are likely leaving digital traces of their daily behavior—traces that can be subjected to advanced data-mining algorithms to infer behavior patterns and to support targeted communications from retailers to consumers.
This is the era of big data—a superconvergence of technology and data-analytic capabilities (Friedman, 2016) that is beginning to have a profound effect on the ways in which knowledge is generated and utilized in the 21st century (Shah, Cappella, & Neuman, 2015). There are two sides to the story of how big data may influence the study and practice of risk communication relevant to health. On the one side, a movement toward a data-rich environment opens an entirely new toolbox of analytic techniques that can be used to expand our knowledge of effective risk communication strategies (Riley, 2016). Advanced techniques can be used in an exploratory way to look for correlations between variables across a greater swath of observations than has ever been made available before, and from these new insights, scientists can begin to develop and refine the explanatory power of theories in human behavior. Interventions then can be developed and tested at scale, with cybernetic feedback loops informing successive iterations of the intervention until it grows into a space of optimized efficiency. Communication scientists can even use the expanded capacity of crowdsourcing platforms, such as Amazon Mechanical Turk, to broaden the generalizability of their own formative research and even to test hypotheses using the controlled stimuli of an online experiment presented to thousands of participants in parallel within a short amount of time.
From the other side of the story, the use of large-scale data systems in healthcare and public health is beginning to create a wholly new environment in which risk communication must unfold. Consider an observation by the Institute of Medicine (now the National Academy of Medicine) that in the early 1990s, the average treatment decision in medicine was based on the presentation of only a few evidentiary facts. As molecular medicine continues to evolve and the selection of genomic and proteomic assays to consider for diagnosis and treatment decisions expands, the number of new data points for consideration in the clinical encounter could number into the hundreds. From studies of human cognition, we know that short-term memory in humans tops out at roughly seven chunks of new information at a time, plus or minus two (IOM, 2008). New decision-support tools and clinical designs will be needed to winnow the relevant facts for immediate decision-making down into a more manageable format. Communication scientists will be needed to help guide the decision-engineering process to provide the right information to the right member of the care team (including the patient), at the right time and in the right format, to improve workflow and decision-making. Both sides of this story merit discussion.
Using Big Data to Enhance Health Communication Research
The Evolution of Big Data in Science
All this new dependency on data, and especially on big data, may seem like an abrupt change from business as usual in risk communication, but in fact it represents a natural tipping point from a trend that began decades earlier. Its origins can easily be traced to the early days of computing, when “mainframe” information systems first made it possible to input thousands of individual data points (from tapes or punch cards) into electronic memory, against which a complex set of data-processing commands—also known as programs—could be executed. It was this early capacity in large-scale data processing that revolutionized business and expanded the horizons of science. The social sciences naturally began to adapt and expand their capacity to perform multivariate statistical analysis to accomplish in minutes what would have taken weeks to calculate by hand. Then, with the introduction of the silicon-based semiconductor, this capacity to process large amounts of data in increasingly shorter amounts of time began to expand. Intel cofounder Gordon Moore famously predicted that with all the advances that he was observing in the semiconductor industry, computing power would continue to grow at an exponential rate for the foreseeable future—essentially doubling every two years. This audacious prediction, known colloquially as “Moore’s Law,” turned out to be surprisingly prescient. In the 1980s, much of the power of mainframes had migrated to scientists’ desktops through personal computers.
While the 1980s saw the rise of the personal computer, the 1990s and early 2000s saw the emergence of mass connectivity through diffusion of the Internet and its user-friendly architecture for collaboration, which became known as the World Wide Web. In this newest phase of the data revolution, technology publisher Tim O’Reilly predicted that data would become the new “Intel Inside” for the next generation of Internet technologies (O’Reilly, 2005). These new technologies would effectively turn the Internet into a large platform for collective participation, he argued, enabling a new era of “open science.” In 2007, Steve Jobs introduced the smartphone to the global market. Expanding on Moore’s Law even further, smartphones funneled the computing power of the old computational mainframes into a device that could sit in a scientist’s pocket as it opened new channels of data flow into remote servers through what would come to be known as the “cloud” (i.e., placing data stores and processing speed into collectively accessible remote servers rather than depending solely on local storage or processing).
Data flows would burgeon further with the introduction of tablet computers, wearable sensors, smart devices, and now—in the age of the “Internet of Things”—smart objects. These converging data flows allow a new paradigm in analytic techniques, allowing high-tech companies such as Google, IBM, Microsoft, and Amazon to improve the intelligence of their respective business systems in real time by harvesting the flow of user data to improve the performance of their search algorithms and personal digital assistants.
The emerging abundance of data streams—catalyzed by Moore’s Law and widened exponentially by the parallel storage capacity of millions of data-recording devices—led Microsoft data architect Jim Gray to conclude that science was entering a “fourth paradigm” in its approach to discovery and knowledge generation (Hey, Tansley, & Tolle, 2009). The first paradigm, he argued, was systematic adoption of the experimental method, which allowed focused, disciplined investigation based on a priori logic of inferred causality. The second was the evolution of theoretical science, emphasizing coherent frameworks for the interconnections among empirical observations. The third was the introduction of computing, which expanded the reach of empirical observation and enabled the application of advanced statistical techniques through the efficiencies afforded by computational systems. The fourth paradigm was the introduction of connected data streams into a data-rich environment, essentially developing science from its historical roots as an individual endeavor into a contemporary approach enabled by cutting-edge platforms for mass participation.
It has been within this context that the popularized notion of “big data” came about. Like any term de jour, the concept has often been misunderstood and subject to hyperbole (Parks, 2014). Engineers at IBM systems characterized the concept in terms of four dimensions, something they referred to as the “four V’s” of big data. The first was “volume,” or simply the unprecedented amount of data stored in digital format. By some estimates, the amount of digital data generated by the world’s computing capacities may exceed 2.5 exabytes (2.5 x 1018 bytes) being added to the global store daily. The second was “variety,” with data inputs available for analysis from software logs, remote sensing devices, cameras, radio-frequency identification, satellite telemetry, microphones, and health informatics infrastructures. A third dimension was “velocity,” with a recognition that broad streams of data can be made available for analysis in real time. The fourth and final dimension was “veracity,” as data scientists struggle with the uncertainty and value of information computed from imprecise sources. These four aspects aptly characterize the tenure of data scientists in public health and medicine who have begun to exploit the power of a data-rich environment to accelerate success in reducing the impact of disease and trauma in the information age.
Big Data in Public Health and Medicine
When the archetypal epidemiologist John Snow gathered systematic observations on outbreaks of cholera in the SoHo district of London in 1854, he was setting into place the beginnings of modern public health surveillance systems (Johnson, 2006). During the 20th century, governments utilized local and national surveillance data sources to inform policy, to stay ahead of disease outbreaks, to identify population trends, to explore hypotheses for research, and to monitor businesses for regulatory adherence to public health safety standards. Some of the data from these sources were extracted directly from public records and compiled through census techniques to gain a better understanding of the incidence and prevalence of disease. These archival sources were soon complemented by data extracted from national probability samples to gain a more comprehensive understanding of how physical environment, social context, lifestyle choices, and behavioral predilections might influence the onset, course, and sequelae of disease (Hesse et al., 2013). Initially, these surveillance programs tended to be run as sequestered efforts, collecting and processing data for the primary benefit of a single agency or sovereignty. Efforts have been underway internationally through the World Health Organization (WHO) to harmonize and share data collected through these systems to enable a more coordinated system of global health tracking. With troves of public health data being collected within the United States alone, suggestions have been made to utilize the power of big data analytics to expand the breadth of analyses across complementary datasets for the purposes of improving health situational awareness among all stakeholders (Hesse, 2011; Thacker, Qualters, & Lee, 2012).
In 2001, the U.S.-based National Committee on Vital and Health Statistics took stock of the data flows connecting decision-makers in public, clinical, and personal health. From their evaluations, it concluded that work would be needed to turn these traditionally paper-based data systems into machine-readable, electronic streams that could be integrated for the public good. The committee outlined an ambitious blueprint for creating an electronic, national health information infrastructure in the United States (National Committee on Vital and Health Statistics, 2001). In a follow-up to the report, U.S. president George W. Bush announced a national goal in his 2004 State of the Union address to Congress and the public to connect the majority of Americans to electronic health records (EHRs) within a decade. In support of that goal, he appointed a National Coordinator for Health Information Technology to oversee the effort.
By 2009, some progress had been made toward achieving that goal, but because of lack of market incentive, adoption was slow. To boost adoption, Congress passed the 2009 Health Information Technology for Economic and Clinical Health (HITECH) Act, featuring provisions for offering financial incentives to hospitals and individual medical practices that demonstrated the meaningful use of Health Information Technology (Health IT) for the improvement of patient care. Adoption rates began to rise precipitously. By 2015, data suggested that 96% of nonfederal acute care hospitals had adopted a certified EHR (Henry, Pylypchuk, Searcy, & Patel, 2016).
Although adoption rates of data-driven health information technologies had risen by 2015, the ability to pool data across healthcare systems remained obstructed. In testimony before Congress, professional medical societies pointed to data blocking—the practice of limiting the export of data to shield proprietary value—as the main culprit. In December 2016, Congress passed the 21st Century Cures Act, which prohibited data blocking and authorized penalties of up $1 million per violation. On the technological front, the Office of the National Coordinator pointed to the value of application programming interfaces (APIs), to enable the flow of data between legacy EHR systems and fit-for-purpose applications to support clinical care and patient self-management. Investments in demonstration projects illustrated how data tools and infrastructures could be built on top of the existing EHR systems to catalyze a thriving ecosystem of consumer-focused apps, much in the same way as mobile smartphone and tablet operating systems allowed development within their respective app stores (Mandl & Kohane, 2016).
One of the ultimate targets of this interoperability development is to improve data liquidity among practice components, and by so doing, to enable the creation of a learning healthcare system (Olsen, Aisner, McGinnis, & Institute of Medicine Roundtable on Evidence-Based Medicine, 2007). The vision of a learning healthcare system is to create an environment in healthcare, fueled by data, that could (a) be leveraged for quality improvement, (b) be utilized for comparative effectiveness research, (c) inform real-time decision-making, and (d) serve as the springboard for new biological and behavioral discovery. Inputs would come from structured fields within EHRs, unstructured fields analyzed through natural-language-processing algorithms, imaging data, laboratory data, genomic-sequencing data, sensor data, and even data generated by patients equipped with mobile accelerometers, blood glucose sensors, cardiologic sensors, and the like. The resulting flows created by such interoperable systems would naturally be of high volume, high velocity, and high variety, and—to serve medical needs—would have to be evaluated for high veracity as well. In 2013, the director of the U.S. National Institutes of Health (NIH) appointed the agency’s first associate director for data science to supervise the research needed to harmonize these data streams and extract scientific value through advanced analytic techniques. One of the first initiatives funded by the associate director’s office was the “Big Data to Knowledge” program, designed to create centers of excellence in data science.
A Big Data Toolbox for Risk Communication Scientists
With these developments as prelude, communication scientists now have an expanded toolbox with which to exploit the value of big data for new knowledge in health and risk communication. Hindman (2015), for example, noted that for years, the overriding analytic paradigm in the social sciences has been dominated by a use of ordinary least squares (OLS) regression techniques. OLS approaches, he points out, are subject to weaknesses and limitations that can be overcome with some of the newer machine-learning approaches being tested within the field of computer science. To begin with, standard OLS techniques are poor compared to other techniques when it came to predicting values beyond the sample from which the regression coefficients were generated. Because the models tend to be poor predictors, they also tend to be less replicable than other techniques. Many of the machine-learning approaches coming out of the big data revolution call for segmenting a robust, large scale dataset into two sets. One set, called the training set, would be used to build the predictive model. The other set, called the test set, would be used to evaluate the performance and replicability of the predictive model. Computer scientists would then use an ensemble of complementary modeling techniques on the training set to ferret out which model, or combinations of models, was best at predicting quantitative relationships reliably and efficiently in the test set. The overall goal is to improve the explanatory power of the resulting statistical model.
Detractors of these machine-learning approaches often argue that data-mining techniques constitute an atheoretical, computational exercise that needs little in the way of deep subject matter understanding of the data being modeled. Nothing could be further from the truth, observed Hindman (2015). Contemporary data-mining techniques require as much of an understanding of the underlying data for interpretation of the resulting models as do linear regression or logit techniques. Moreover, because big data methodologies necessarily value parsimony as a cardinal principle, the resulting models can be integrated back into a conceptual understanding of the underlying phenomenon in straightforward ways—something that stands in contrast to the overly complex OLS models built from a kitchen-sink approach to variable inclusion. On the other hand, it remains true that for big data approaches—as in all empirical approaches—the same principles of rigorous scientific understanding must apply. The following caveats are worthwhile to bear in mind:
Correlation does not imply causation. One big data technique cultivated through genomewide association studies (GWAS) allows researchers to search for cross-study correlations between single nucleotide polymorphisms (SNPs) and the occurrence of disease. This technique produces skyline charts, in which frequently implicated SNPs accumulate stacks of repeatedly found correlations that rise higher than SNPs, with no correlative relationship for a disease. When portrayed graphically, these differentially accrued stacks of correlation resemble a skyline, with a mixture of low-rise and high-rise stacks. The charts are very powerful for suggesting areas of fruitful investigation; nevertheless, their use should be exploratory in nature. Although machine-learning techniques can excel at revealing correlations of varied complexity at multiple levels of modeling, they are insufficient at producing the replicable, conceptual understanding that will be needed to generalize beyond the tested datasets. Big data, argued Coveney, Dougherty, and Highfield (2016), will need Big Theory too. Further experimental work needs to be conducted before inferences on causation are drawn and the necessary conceptual advances are made to the scientific knowledge base. It still will take conceptual rigor and faithful adherence to the scientific process to turn big data into knowledge in its fullest sense.
Garbage in still leads to garbage out. Much has been written about the power of big data analytics to aid in the evaluation of political races (Silver, 2012). Those successes seemed to fade in comparison to the repeated occurrence of miscalculation around the globe in 2016. All the reasons offered for why these large-scale analytic efforts faltered revolved around the quality of data being fed into the statistical models. When it comes to polling, surveillance research, and biomedical observation, the results will continue to falter if the data going into the models are inherently biased or noisy.
P-hacking and publication bias. The cudgel of “dustbowl empiricism” has justifiably been leveraged against researchers who deliberately inflate their sample sizes or rework their analyses in a post-hoc fashion until they have exceeded an arbitrary threshold of (e.g., p < .05) for publication purposes. The practice, referred to as “p-hacking,” is a risk within data-rich environments (Simonsohn, Nelson, & Simmons, 2014).
Open-science and data-sharing advocates have reasoned that under the right circumstances, big data approaches could help address some of these issues. Preregistration of hypotheses and variables can serve to limit a posteriori fishing, while data sharing can inform meta-analyses and replicability studies (Ioannidis, Munafo, Fusar-Poli, Nosek, & David, 2014; Nosek et al., 2015). Machine-learning techniques, as described earlier, emphasize replicability as a critical dimension of analysis (Hindman, 2015).
Big Data and the Public Web
Another area in which big data have come into play relevant to risk communication is through the open environment of the public Web. To be sure, marketing companies have been utilizing big data collected from search logs, clickstream data, persistent cookies, and automated Web crawlers to anticipate the needs and predict the behavior of their customers for years. In 2008, the online search giant Google made headlines when it demonstrated how an analysis of spikes in its search data for flulike symptoms could serve as an early warning system for localized flu outbreaks. Although the algorithm for predicting flu outbreaks underperformed in comparison to official statistics released by the Centers for Disease Control and Prevention (CDC) in subsequent years, many public health scientists have continued to explore improvements to the underlying statistical models. A promising area of improvement has been through the use of Autoregressive Integrated Moving Average (ARIMA) approaches to control for temporal dependencies. Public health scientists have successfully utilized big data approaches to improve predictions of outbreaks in influenza, Ebola, dengue, and Zika (Yang, Santillana, & Kou, 2015).
Big data analytics also have been applied to the task of mining public Web resources about noninfectious diseases. Tourassi and her colleagues at the Oak Ridge Laboratories, for example, utilized Web-mining software and advanced natural-language-processing methods applied to online obituaries to replicate incidence and mortality data from the National Cancer Institute’s Surveillance Epidemiology and End Results (SEER) program (Tourassi, Yoon, Xu, & Han, 2016). Soliman, Nasraoui, and Cooper (2016) used an advanced text-mining approach to extract information about gene interactions related to glaucoma from the National Library of Medicine’s online PubMed database, and then used that data to benchmark an interaction network associated with the disease. Yoon and Bakken (2012) demonstrated how Web-mining techniques could be applied to the microblogging context of tweets to uncover mental models related to physical activity, and Chan, Lopez, and Sarkar (2015) investigated the nonmedical use of opioids using similar efforts. Cappella, Yang, and Lee (2015) modeled a recommendation system for tailoring messages to smokers based on the high-density data algorithms used by commercial content providers such as Amazon and Netflix.
The public Web has also served as a platform for mass participation in knowledge creation and data generation. The public site PatientsLikeMe is one such platform, where patients with a specific diagnosed disease can go online to compare the progress of their treatments with those of patients who are similarly diagnosed. With more than 300,000 registered users in 2015, the site encourages its members to share personal data over the course of their treatments so that biomedical researchers and pharmaceutical manufactures can collect the data that they need to uncover side effects and develop new hypotheses on targets for treatment. In a similar way, the U.S. Food and Drug Administration (FDA) opened its Sentinel database to record adverse events from patients through social media feeds, and by so doing, was able to expand its coverage of reported side effects (Ball, Robb, Anderson, & Dal Pan, 2016). Each of these examples represents a type of crowdsourcing, a term coined to describe the multiplying effect of asking large groups of people to contribute solutions to a problem through parallel inputs on the Web. The Amazon Mechanical Turk platform, which allows individuals to perform online tasks for negotiated amounts of money, has become a mainstay of crowdsourcing for many academic researchers.
To understand more about how large groups of individuals can help accelerate science by contributing data in parallel, the National Science Foundation has invested heavily in citizen science projects. In one such project, laypersons with asthma volunteered to use specially equipped smartphones to monitor air quality in their local environment. Geopositioning data affixed to the transmitted data streams allowed researches to locate readings in real time on a virtual map. The volunteers benefited immediately by learning more about the environmental quality of locations in which they might find themselves, while the donated data streams could be analyzed through big data analytics for geospatial and temporal trends.
The Robert Wood Johnson Foundation also has invested in understanding how crowdsourcing principles can be brought to bear on accelerating discovery in the context of health. The foundation has coined the phrase “data for good” to describe how individuals imbued with a sense of altruism can contribute data from personal monitoring devices with the purpose of furthering research on health promotion. Data altruism lies at the core of many of the large biomedical research initiatives being funded by the NIH. One of these initiatives, termed the “All of Us Research Program,” will solicit data contributions from patients over their lifespans using patient portals, EHRs, and mobile monitoring devices. The thrust will be to collect the data necessary to calibrate treatments more precisely to patients’ genomic and phenotypic profiles. In this sense, the program is intended to be the platform upon which NIH will pursue its goals in the much-publicized area of precision medicine (NIH, 2016).
Implications of Big Data for Communicators
The idea of collecting mounds of raw data and then dumping it in front of decision-makers and the public without a means for interpretation is anathema to good health communication principles. Yet this is a perennial problem that repeatedly occurs in the age of electronically connected data streams. Journalist David Shenk coined the term data smog to describe the situation occurring at the dawn of the digital age, in which information seekers on the Web were exposed to disconnected islands of medical and health data presented without the background necessary to synthesize or interpret them (Shenk, 1997). Indeed, early reviews of publicly facing health information sites revealed that much of the information posted online required an advanced college education to interpret and understand; moreover, little effort was made to translate the information into languages other than English (Berland et al., 2001). Data from the Health Information National Trends Survey, a general population survey conducted by the NIH, has suggested a trend of increasing confusion regarding the basic principles behind health promotion and disease prevention as the public became exposed to the cacophony of conflicting studies and health messages online (Hesse, Greenberg, & Rutten, 2016).
The answer to the problem of data smog is to be proactive in creating the methods, training, and formats needed to turn numeric information into actionable knowledge within a health context. This will require an interdisciplinary approach to reduce noise from signals at every point in which data must be translated from their raw state into the usable information that can inform patients, doctors, the general public, and policymakers (Nelson, Hesse, & Croyle, 2009). Some of the initial points of responsibility will lie with the computer scientists and engineers responsible for creating the applications needed to process and then present data to users. The accomplishment of modern information tools such as the common GPS system is the compelling way that it translates gigabytes of real-time data into an easily comprehensible information system. This was not an easy task—it took years of human factors and human-computer interaction to perfect. The same must become true for health-related information systems. One of the biggest complaints leveraged by the American Medical Association against the current spate of EHR systems is that the systems are not user-friendly. Efforts are currently underway to reexamine the interfaces accompanying these systems to ensure that each component operates in symphony with others to support comprehension, facilitate decision-making, and clarify action.
The opportunity, and necessity, for risk communication scientists to participate in this retooling of data systems in medicine is tremendous. Risk communication scientists bring with them a firm theoretical understanding of how data can inform decision-making in reliable ways if formatted and presented correctly, while being keenly aware of how the limitations of human cognition can lead to confusion or errant behavior if data are presented poorly. The following are just some of the opportunities for interdisciplinary research at the juncture of risk communication and big data:
Formatting data displays to support high-performance decision-making. Decision scientists have long noted that medical judgment can be prone to bias and error. Even well-trained hospital personnel can routinely make mistakes when drawing conclusions from medical data. In one study, for example, obstetricians routinely concluded that the probability of a patient having breast cancer, given a positive result on a mammogram with known rates of sensitivity (true positives) and specificity (true negatives), was around 80%; the true answer was closer to 8% (Gigerenzer & Edwards, 2003). The underlying reason for this common error was due largely, the authors reasoned, to the artificial and counterintuitive nature of Bayesian statistics. Just by presenting the data differently, in this case in the form of natural frequencies, the investigators were able to keep from making the erroneous conclusions. Another effective method was to augment users’ understanding of data by improving the format in which they are presented to patients and caregivers (Ancker, Chan, & Kukafka, 2009; Ancker, Senathirajah, Kukafka, & Starren, 2006; Fagerlin, Wang, & Ubel, 2005; Fagerlin, Zikmund-Fisher, & Ubel, 2011; Zikmund-Fisher, Fagerlin, & Ubel, 2008). As medicine pivots toward data-rich decision environments, the need will grow for communication scientists to help in the design of decision aids, digital displays, and electronic dashboards with the goal of improving decisional efficacy in high-stakes environments (Shneiderman, Plaisant, & Hesse, 2013).
Improving the credibility and usability of data on the public Web. Outside the clinical environment, Web system engineers have been working with communication scientists to improve the user-friendliness of their online data support tools. There are several useful resources that have been prepared to assist Web designers and communicators in making user-friendly sites for data access. One of the more comprehensive style guides based on levels of supporting evidence can be found at the U.S. Department of Health and Human Service’s usability.gov website. Another scientifically vetted style guide for communicating health data can be found at the “Visualizing Health” website (www.vizhealth.org), presented jointly by the University of Michigan and the Robert Wood Johnson Foundation. In 2009, the U.S. government made a concerted effort in making the data that it collected on behalf of taxpayers available to users and entrepreneurs. A collection of publicly available data resources can be found at data.gov. Some of these resources can be viewed through the lens of an already configured website. Others can be accessed as raw data streams that can be integrated through APIs into mobile apps or Web services. Meteorological data from the National Atmospheric and Oceanographic Administration are routinely streamed to mobile weather apps on smartphones, websites, and wearable technologies. Health data can be found among the catalogued resources.
Providing support to communities. Revisiting their blueprint for a national information infrastructure, the National Committee on Vital and Health Statistics has emphasized the role of local communities in gathering and presenting the data needed to address health disparities and to promote population health locally. A good example of this approach can be found in the groundbreaking work of communication scientists collaborating as liaisons between the North Carolina medical systems and patient advocacy groups. Utilizing data from the system’s health information exchanges, data scientists could pinpoint areas of vulnerability in which services had not been provided in equal measure across all populations. One analysis, for example, alerted hospital administrators that African Americans living in a certain part of the catchment area were not taking full advantage of a surgical procedure considered to be the standard of care. Communication specialists were sent to visit these potential surgical candidates directly and to provide the kind of sensitive and nuanced consultation needed to ensure that all members of the healthcare system were given equal access to medical services. In this way, hospital administrators utilized analyses of big data to address points of vulnerability within their jurisdictions (Oh et al., 2016). Accountability for population health outcomes in the local community was one of the principal objectives of the HITECH Act of 2009, as well as the Medicare Access and CHIP Reauthorization Act of 2015 (MACRA).
Using big data to manage high-risk and high-cost patients. In the U.S. health system, roughly 5% of all patients account for 50% of the costs. There have been many reasons offered for why this might be the case. One often-cited explanation is a historical emphasis on fee-for-service reimbursement within the health insurance industry that inadvertently may have favored late-stage treatments over prevention or continuity of care. Because of this lack of emphasis on prevention, more patients may be presenting with multiple chronic conditions later in life. Those patients will account for the lion’s share of costs in their respective healthcare systems. Moreover, disconnects in a fragmented care system can mean that physicians are routinely ordering redundant tests and might be doubling up on expensive treatments for a small group of patients. Now that EHR systems are becoming more commonplace and big data approaches are being enabled in healthcare, it should be possible to predict trends ahead of time through big data analytics. Communication practitioners can work with these trends in helping patients anticipate the likely risks associated with certain high-risk profiles. Predictive analytics can be used to protect high-risk populations from falling prey to extremely debilitating disease conditions and financially toxic treatment expectations (Bates, Saria, Ohno-Machado, Shah, & Escobar, 2014).
Using big data to personalize healthcare. As the diffusion of smartphones, wearable technologies, and the Internet of Things expands, consumers are gaining access to personalized streams of data at unparalleled rates. For risk communication scientists, there are many levels of opportunity with this latest diffusion of innovations. First, there is a need to understand how exactly to use these patient-facing technologies to support healthy behaviors and personal decision-making. For example, “just-in-time adaptive interventions” are a type of communication intervention that utilize streaming data from smartphones or wearable computing devices to support healthy decisions in real time. Understanding the nature, dose, and frequency of the messages that comprise these interventions is rapidly becoming an important area of risk communication research (Stone, 2007). Second, changes in EHRs and medical practice are making it possible for patients to share some of these personal data with selected members of their care team. Understanding how to turn those data streams into a comprehensible vehicle for patient-provider communication and situational awareness will be another area of needed research (Dimitrov, 2016). Third, data collected from millions of individual data donors can begin to narrow treatment recommendations in more precise ways. This is the basis of “precision medicine”: to use a burgeoning database of molecular markers, genomically derived tendencies, and contextual influences to tailor treatment to the individual circumstances of each individual patient. Research will be needed to turn the complexity of these data-enabled treatment regimens into easy-to-follow prescriptions for patients and their care teams (Chawla & Davis, 2013).
A Final Word
As we have seen, big data approaches to health and medicine are becoming ubiquitous in health research and practice. As in any aspect of science, taking advantage of these new capabilities will require meticulous attention to detail, rigor, and a disciplined resistance to areas of unwarranted hyperbole. To be sure, risks and pitfalls abound. Nontransparent algorithms buried within the equities trading sector were blamed for precipitating the global recession of 2008–2009. Reports of hospital-targeted ransomware, the use of data for profit without consent, and many other highly publicized examples of malfeasance may have a chilling effect on patients’ trust in donating their personal data for research. Still, the benefits associated with this fourth paradigm in science will ultimately outweigh the costs. The key will be to enlist the full participation of a multidisciplinary scientific base, including the active involvement of communication scientists, in turning big data into knowledge and knowledge into action.
- Ancker, J. S., Barron, Y., Rockoff, M. L., Hauser, D., Pichardo, M., Szerencsy, A., & Calman, N. (2011). Use of an electronic patient portal among disadvantaged populations. Journal of General Internal Medicine, 26(10), 1117–1123.
- Ancker, J. S., Chan, C., & Kukafka, R. (2009). Interactive graphics for expressing health risks: Development and qualitative evaluation. Journal of Health Communication, 14(5), 461–475.
- Ancker, J. S., Senathirajah, Y., Kukafka, R., & Starren, J. B. (2006). Design features of graphs in health risk communication: A systematic review. Journal of American Medical Informatics Association, 13(6), 608–618.
- Ball, R., Robb, M., Anderson, S. A., & Dal Pan, G. (2016). The FDA’s sentinel initiative—a comprehensive approach to medical product surveillance. Clinical Pharmacology & Therapeutics, 99(3), 265–268.
- Bates, D. W., Saria, S., Ohno-Machado, L., Shah, A., & Escobar, G. (2014). Big data in health care: Using analytics to identify and manage high-risk and high-cost patients. Health Affairs, 33(7), 1123–1131.
- Berland, G. K., Elliott, M. N., Morales, L. S., Algazy, J. I., Kravitz, R. L., Broder, M. S., . . ., McGlynn, E. A. (2001). Health information on the Internet: Accessibility, quality, and readability in English and Spanish. JAMA, 285(20), 2612–2621.
- Cappella, J. N., Yang, S., & Lee, S. (2015, May). Constructing recommendation systems for effective health messages using content, collaborative, and hybrid algorithms. Annals of the American Academy of Political and Social Science, 659, 290–306.
- Chan, B., Lopez, A., & Sarkar, U. (2015). The canary in the coal mine tweets: Social media reveals public perceptions of non-medical use of opioids. PLoS ONE, 10(8), e0135072.
- Chawla, N. V., & Davis, D. A. (2013). Bringing big data to personalized healthcare: A patient-centered framework. Journal of General Internal Medicine, 28, S660–S665.
- Coveney, P. V., Dougherty, E. R., & Highfield, R. R. (2016). Big data need big theory too. Philosophical Transactions of the Royal Society A—Mathematical Physical and Engineering Sciences, 374(2080).
- Dimitrov, D. V. (2016). Medical Internet of Things and big data in healthcare. Healthcare Informatics Research, 22(3), 156–163.
- Fagerlin, A., Wang, C., & Ubel, P. A. (2005). Reducing the influence of anecdotal reasoning on people’s health care decisions: Is a picture worth a thousand statistics? Medical Decision Making, 25(4), 398–405.
- Fagerlin, A., Zikmund-Fisher, B. J., & Ubel, P. A. (2011). Helping patients decide: Ten steps to better risk communication. Journal of National Cancer Institute, 103(19), 1436–1443.
- Friedman, T. L. (2016). Thank you for being late: An optimist’s guide to thriving in the age of accelerations. New York: Farrar, Straus, and Giroux.
- Gigerenzer, G., & Edwards, A. (2003). Simple tools for understanding risks: From innumeracy to insight. BMJ, 327(7417), 741–744.
- Henry, J., Pylypchuk, Y., Searcy, T., & Patel, V. (2016). Adoption of electronic health record systems among U.S. non-federal acute care hospitals: 2008–2015. Washington, DC: The Office of the National Coordinator for Health Information Technology.
- Hesse, B. W. (2011). Public health surveillance in the context of growing sources of health data: A commentary. American Journal of Preventative Medicine, 41(6), 648–649.
- Hesse, B. W., Greenberg, A. J., & Rutten, L. J. F. (2016). The role of Internet resources in clinical oncology: Promises and challenges. Nature Reviews Clinical Oncology, 13, 767–776.
- Hesse, B. W., Nelson, D. E., Moser, R. P., Blake, K. D., Chou, W.-Y. S., Finney Rutten, L. J., & Beckjord, E. B. (2013). National health communication surveillance systems. In D. K. Kim, A. Singhal, & G. L. Kreps (Eds.), Global health communication strategies in the 21st century: Design, implementation, and evaluation (pp. 317–334). New York: Peter Lang.
- Hey, T., Tansley, S., & Tolle, K. (2009). The fourth paradigm: Data-intensive scientific discovery. Redmond, WA: Microsoft Research.
- Hindman, M. (2015, April). Building better models: Prediction, replication, and machine learning in the social sciences. Annals of the American Academy of Political and Social Science, 659, 48–62.
- Institute of Medicine (IOM). (2008). Evidence-based medicine and the changing nature of health care: 2007 IOM annual meeting summary. Washington, DC: National Academies Press.
- Ioannidis, J. P., Munafo, M. R., Fusar-Poli, P., Nosek, B. A., & David, S. P. (2014). Publication and other reporting biases in cognitive sciences: Detection, prevalence, and prevention. Trends in Cognitive Sciences, 18(5), 235–241.
- Johnson, S. (2006). The ghost map: The story of London’s most terrifying epidemic—and how it changed science, cities, and the modern world. New York: Riverhead Books.
- Mandl, K. D., & Kohane, I. S. (2016). Time for a patient-driven health information economy? New England Journal of Medicine, 374(3), 205–208.
- National Committee on Vital and Health Statistics. (2001). Information for health: A strategy for building the National Health Information Infrastructure. Washington, DC: Department of Health and Human Services.
- National Institutes of Health (NIH). (2016). PMI Cohort Program announces new name: The All of Us Research Program. Retrieved from https://www.nih.gov/allofus-research-program/pmi-cohort-program-announces-new-name-all-us-research-program.
- Nelson, D. E., Hesse, B. W., & Croyle, R. T. (2009). Making data talk: Communicating health data to the public, policy, and the press. New York: Oxford University Press.
- Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., . . ., Yarkoni, T. (2015). Scientific standards: Promoting an open research culture. Science, 348(6242), 1422–1425.
- O’Reilly, T. (2005). What is Web 2.0? Design patterns and business models for the next generation of software. Retrieved from http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html.
- Oh, A., Chou, W.-Y. S., Jackson, D., Cykert, S., Jones, N., Schaal, J., . . ., Community, R. X. (2016). Reducing cancer disparities through community engagement: The promise of informatics. In B. W. Hesse, D. K. Ahern, & E. Beckjord (Eds.), Oncology informatics: Using health information technology to improve processes and outcomes in cancer (pp. 23–39). Boston: Elsevier.
- Olsen, L., Aisner, D., McGinnis, J. M., & Institute of Medicine Roundtable on Evidence-Based Medicine. (2007). The Learning Healthcare System: Workshop summary. Washington, DC: National Academies Press.
- Park, T. (2011). Information “liberacion.” Interviewed by Mark Hagland. Healthcare Informatics, 28(12), 45–46.
- Parks, M. R. (2014). Big data in communication research: Its contents and discontents. Journal of Communication, 64, 355–360.
- Riley, W. T. (2016). A new era of clinical research methods in a data-rich environment. In B. W. Hesse, D. K. Ahern, & E. Beckjord (Eds.), Oncology informatics: Using health information technology to improve processes and outcomes in cancer (pp. 343–357). Boston: Elsevier.
- Shah, D. V., Cappella, J. N., & Neuman, W. R. (2015). Special issue: Toward computational social science: Big data in digital environments. Annals of the American Academy of Political and Social Science, 659(May), 1–318.
- Shenk, D. (1997). Data smog: Surviving the information glut. San Francisco: Harper Edge.
- Shneiderman, B., Plaisant, C., & Hesse, B. W. (2013). Improving healthcare with interactive visualization. IEEE Computer, 46(5), 58–66.
- Silver, N. (2012). The signal and the noise: Why so many predictions fail—but some don’t. New York: Penguin Press.
- Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143(2), 534–547.
- Soliman, M., Nasraoui, O., & Cooper, N. G. (2016). Building a glaucoma interaction network using a text mining approach. BioData Mining, 9, 17.
- Stone, A. A. (2007). The science of real-time data capture: Self-reports in health research. Oxford and New York: Oxford University Press.
- Thacker, S. B., Qualters, J. R., & Lee, L. M. (2012). Public health surveillance in the United States: evolution and challenges. MMWR: Surveillance Summaries, 61, 3–9.
- Tourassi, G., Yoon, H. J., Xu, S., & Han, X. (2016). The utility of web mining for epidemiological research: Studying the association between parity and cancer risk. Journal of American Medical Informatics Association, 23(3), 588–595.
- Yang, S., Santillana, M., & Kou, S. C. (2015). Accurate estimation of influenza epidemics using Google search data via ARGO. Proceedings of the National Academy of Sciences USA, 112(47), 14473–14478.
- Yoon, S., & Bakken, S. (2012). Methods of knowledge discovery in tweets. NI 2012 (2012), 2012, 463.
- Zikmund-Fisher, B. J., Fagerlin, A., & Ubel, P. A. (2008). Improving understanding of adjuvant therapy options by using simpler risk graphics. Cancer, 113(12), 3382–3390.
- Ethical Issues in Health Promotion and Communication Interventions
- Big Data and Communication Research
- Selective Avoidance and Exposure
- Blogging, Microblogging, and Exposure to Health and Risk Messages
- Statistical Evidence in Health and Risk Messaging
- Health and Risk Policymaking, the Precautionary Principle, and Policy Advocacy
- Using Maps to Display Geographic Risk, Personal Health Data, and Ownership
- Ethical Issues and Considerations in Health and Risk Message Design