Crowdsourcing for robust, real-time COVID-19 data

Timely data on the contagion will be crucial to saving millions of lives, say Weixing Zhang*, Rockli Kim** and S.V. Subramanian***.

doi:10.1038/nindia.2020.109 Published online 14 July 2020

Crowdsourced data play a major role in monitoring and battling COVID-19. The Johns Hopkins University coronavirus dashboard, with data collected from a variety of sources including governments, news and social media, is a case in point1.

In low- and middle-income countries with limited medical and public health resources, cases and deaths due to COVID-19 are shooting up rapidly. According to the World Health Organization’s COVID-19 dashboard, at the end of the first week of July 2020, India had around 719,665 confirmed cases, the third-highest worldwide. Worryingly, testing for COVID-19 in India is still restricted to at-risk individuals, not representative of the general population2. Despite the increasing positivity rate in the country though, the COVID-19 death rate has been declining3.

Timely data on infection rate are particularly crucial for decision makers to formulate evidence-based life-saving policies and prioritize counter measures4.

Consider this. A popular COVID-19 crowdsourcing data site received 1.75 billion page views5 as of 20 May 2020. The data have also been widely used in the scientific community for modeling community infection patterns and predictions based on Google Scholar search results. Such crowdsourcing has shown great potential in many fields, and particularly so in public health. Given its timeliness in collating information6, it is vital to understand the importance of crowdsourced data in India, compared to official data on COVID-19.

An analysis of COVID-19 data from India’s Ministry of Health and Family Welfare (MoHFW) from 14 March 2020 to 19 June 2020 -- the earliest available -- and data from the crowdsourcing site across all states and union territories in India (Table 1) shows that the official data is generally a day behind the crowdsourcing site. The time lag became greater as more cases were confirmed in May and June compared with March and April. 

Andhra Pradesh has the shortest average time lag one fourth of a day, followed by Ladakh and Mizoram with an average time lag of half a day. In some states and union territories (for example, Chandigarh, Himachal Pradesh, Odisha, Assam, Arunachal Pradesh, and Dadra and Nagar Haveli and Daman and Diu), the time lag is up to two days.

Table 1. The time lag (in days) between COVID-19 data from the Ministry of Health and Family Welfare, India and the data from the crowdsourcing site.

The crowdsourcing site appears to be updated more timely than the government's site. The MoHFW’s website mentions updating the data at a scheduled time while the crowdsourcing site updates the data based on state press bulletins, official social media handles of Chief Ministers or Health Ministers, Press Information Bureau, Press Trust of India, and Asian News International reports that are “generally more recent”. 

MoHFW only provides the most recent district-wise data in a PDF format that consists of some districts (e.g., the district-wise report released on 7 July 2020 includes 429 out of 718 districts) without a time stamp. Some district-wise data released earlier included fewer than 200 districts. 

In contrast, the crowdsourcing site provides data at the district and city levels with information back from 4 January 2020 in both application programming interface (API) and spreadsheet formats. Due to these differences between the official and crowdsourced data, the aggregated case numbers by administrative units (e.g., states) from two sources cannot be directly compared. This leaves room for potential inconsistencies. A data matching analysis conducted at the individual record level is much needed to explore this issue.

Each of these sources has strengths and limitations. On the one hand, faster response time and geographic granularity could have far-reaching impacts on informed policy decisions and scientific recommendations of COVID-19 counter measures. However,  official data are considered to be more authoritative because they have been verified and reconciled with the Indian Council of Medical Research (ICMR), according to the MoHFW.

(The authors are from the *Harvard Center for Population and Development Studies; **Division of Health Policy & Management, College of Health Science, Korea University; and ***Harvard Center for Population and Development Studies. They thank for making the COVID-19 crowdsourced data accessible through API; and Salil J, Vivek Khanna, and their teammates for making COVID-19 data from India’s MoHFW accessible through API.)

Nature India's latest coverage on the novel coronavirus and COVID-19 pandemic here. More updates on the global crisis here.


1. Dong, E. et al. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect. Dis. 20, 533-534 (2020)

2. Subramanian, S. V. & James, K. S. Use of the Demographic and Health Survey framework as a population surveillance strategy for COVID-19. Lancet Glob. Health (2020)

3. Sinha, A. India coronavirus numbers explained: Two trends still strong — increasing positivity rate, declining death rate. The Indian Express, 14 July 2020 Article

4. Morgan, O. How decision makers can use quantitative approaches to guide outbreak responses. Philos. T. R. Soc. B. 374, 20180365 (2019)

5. Polidor, K. Binghamton computer science student builds website to track COVID-19 in India. Binghamton University (2020) Article

6. Wazny, K. “Crowdsourcing” ten years in: A review. J. Glob. Health. (2017)

7. Xu, B. et al. Open access epidemiological data from the COVID-19 outbreak. Lancet Infect. Dis. 20, 534 (2020)