(Cross-posted from the Official Google Blog)
Like many Googlers, we're fascinated by trends in online search queries. Whether you're interested in U.S. elections, today's hot trends, or each year's Zeitgeist, patterns in Google search queries can be very informative. Last year, a small team of software engineers began to explore if we could go beyond simple trends and accurately model real-world phenomena using patterns in search queries. After meeting with the public health gurus on Google.org's Predict and Prevent team, we decided to focus on outbreaks of infectious disease, which are responsible for millions of deaths around the world each year. You've probably heard of one such disease: influenza, commonly known as "the flu," which is responsible for up to 500,000 deaths worldwide each year. If you or your kids have ever caught the flu, you know just how awful it can be.
Our team found that certain aggregated search queries tend to be very common during flu season each year. We compared these aggregated queries against data provided by the U.S. Centers for Disease Control and Prevention (CDC), and we found that there's a very close relationship between the frequency of these search queries and the number of people who are experiencing flu-like symptoms each week. As a result, if we tally each day's flu-related search queries, we can estimate how many people have a flu-like illness. Based on this discovery, we have launched Google Flu Trends, where you can find up-to-date influenza-related activity estimates for each of the 50 states in the U.S.
The CDC does a great job of surveying real doctors and patients to accurately track the flu, so why bother with estimates from aggregated search queries? It turns out that traditional flu surveillance systems take 1-2 weeks to collect and release surveillance data, but Google search queries can be automatically counted very quickly. By making our flu estimates available each day, Google Flu Trends may provide an early-warning system for outbreaks of influenza.
For epidemiologists, this is an exciting development, because early detection of a disease outbreak can reduce the number of people affected. If a new strain of influenza virus emerges under certain conditions, a pandemic could emerge and cause millions of deaths (as happened, for example, in 1918). Our up-to-date influenza estimates may enable public health officials and health professionals to better respond to seasonal epidemics and — though we hope never to find out — pandemics.
We shared our preliminary results with the Epidemiology and Prevention Branch of the Influenza Division at CDC throughout the 2007-2008 flu season, and together we saw that our search-based flu estimates had a consistently strong correlation with real CDC surveillance data. Our system is still very experimental, so anything is possible, but we're hoping to see similar correlations in the coming year.
We couldn't have created such good models without aggregating hundreds of billions of individual searches going back to 2003. Of course, we're keenly aware of the trust that users place in us and of our responsibility to protect their privacy. Flu Trends can never be used to identify individual users because we rely on anonymized, aggregated counts of how often certain search queries occur each week. The patterns we observe in the data are only meaningful across large populations of Google search users.
Flu season is here, so avoid becoming part of our statistics and get a flu shot! And keep an eye on those graphs if you're curious to see how the flu season unfolds...
Update on 11/21: The team just published an academic paper in Nature, the international journal of science, explaining the science and methodology behind Flu Trends. Check it out for more information.