Newbie Adventures in COVID Data — Maps for Hotspots and Positivity Rates
While there are a number of sites with maps and charts for COVID-19 outbreaks, it is always an adventure to dig into the data and discover for yourself likely clean-up and data smoothing involved. At the start, when looking at COVID data for my home state of Massachusetts back June, I found consistent data reporting along with demographics. As I branched out to other states, there was less consistent reporting and some days the only tests reported were positive. This lead to the discovery of the COVID ACT NOW APIS which provided a standard output across all the states.
The goal for this effort was to create my own map for positivity rates leveraging moving averages to address the periodicity of data reporting.
Note: I am not a professional data scientist or epidemiologist, just a hobbyist, so please read this article as an adventure learning data science rather that an expert presentation.
To start, load the libraries:
import numpy as np
import pandas as pd
from pandas import DataFrame
import plotly.graph_objects as go
import plotly.express as px
from pandas.io.json import json_normalize
from datetime import datetime, date, time, timedelta
Then call the COVID ACT NOW APIS(leveraging the time series for trends):
Call_URL=”https://data.covidactnow.org/latest/us/states.NO_INTERVENTION.timeseries.json"response = requests.get(Call_URL)
data = response.json()
The next steps were to clean up the JSON output from the APIS and to pull the JSON into a data frame:
meta=[‘stateName’, ‘lastUpdatedDate’,”fips”])COVID_df = DataFrame(COVID_data,columns=[‘date’,”fips”,”population”,”cumulativePositiveTests”,
For easier comparisons, I converted the dates to YYYY-MM-DD format and set the date as the index.
COVID_df[‘date’] = pd.to_datetime(COVID_df[‘date’], format=’%Y-%m-%d’)
Now onto to data clean-up:
As the data can be reported mid- to later in the day, I set the start to pull data from the prior day.
today = datetime.now()
yesterday = date.today() — timedelta(days=1)
Then I subsetted the data to start on March 15th as that seemed to be where things were more consistent. For the purposes of the maps only I could have just selected recent months, but for future trending work I grabbed a larger subset.
start_date = “2020–03–15”
start_date_time = datetime.strptime(start_date, ‘%Y-%m-%d’)
Chart_Data = COVID_df.loc[COVID_df[‘date’] >= start_date_time]
Chart_Data = COVID_df.loc[COVID_df[‘date’] <= yesterday]
Then I created functions for the calculations to:
- Calculate Daily testing as delta between daily cumulative counts
- Calculate Daily Positivity Rate
- Compute an exponential moving average (7 day span) for the Daily Positivity Rate
- Calculate a rolling moving average(7 day) for Positive Cases per 100,000 as proxy for hotspots
Because some states report sporadically and others every day, data cleanup was helpful between the functions to strip out days where there was no testing or days where all tests were reported as positive.
As the data was reported as cumulative, the first step was to break out daily data with a function called CovidDiff. The data is in a series, so each function needed to be applied to the state group.
Chart_Data[‘Dly_PosTests’] = Chart_Data[‘cumulativePositiveTests’].diff()
Chart_Data[‘Dly_NegTests’] = Chart_Data[‘cumulativeNegativeTests’].diff()
Chart_Data[‘Dly_Tests’] = Chart_Data[‘Dly_PosTests’] + Chart_Data[‘Dly_NegTests’]
I ran this starting in the second row to allow for the row back calculations:
covid_range=(len(Chart_Data)-1)for i in range(1,covid_range):
Then to clear out the all positive days:
Chart_Data = Chart_Data[Chart_Data[“Dly_NegTests”] > 0]
The next function, CovidPos, calculated daily positive rates and positive tests for 100,000 in population.
Chart_Data[‘Dly_Pos_Rate’] = (100* (Chart_Data[‘Dly_PosTests’] / Chart_Data[‘Dly_Tests’]))
Chart_Data[‘Pos_per_100K’] = ((Chart_Data[‘Dly_PosTests’] * 100000) / Chart_Data[‘population’])
Finally a function for the moving averages:
Chart_Data[‘Dly_Pos_Rate_EXA’] = Chart_Data.Dly_Pos_Rate.ewm(span=7,adjust=False).mean()
Chart_Data[‘Pos_per_100K_MA’] = Chart_Data.Pos_per_100K.rolling(window=7, min_periods=1).mean()
Then to narrow down to the most current data:
Plot_data = Chart_Data.sort_values(by=”date”).drop_duplicates(subset=[“fips”], keep=”last”)
As Plotly seems to prefer the state codes versus FIPS for state maps, I did a merge of my list of state codes:
State_code_list = pd.read_csv(‘https://raw.githubusercontent.com/AlisonDoucette/Files/master/State-Name-Code.csv')
Map_data = pd.merge(Plot_data,State_code_list, on=”stateName”)
Finally the maps:
- One for Hot Spots (more than 10 positive cases per day per 100,000 population is considered really “hot”).
import plotly.graph_objects as go
fig = go.Figure(data=go.Choropleth(
locations=Map_data[‘Code’], # Spatial coordinates
z = Map_data[‘Pos_per_100K_MA’].astype(float), # Data to be color-coded
zmax = 100,
locationmode = ‘USA-states’, # set of locations match entries in `locations`
colorscale = ‘temps’,
colorbar_title = “Positive Cases per 100K”,))fig.update_layout(
title_text = ‘Hot Spot States’,
geo_scope=’usa’, # limit map scope to USA
- One for Daily Positivity Rates
With this cleaned up data I could also look at trends in testing or add in the data around hospitalization or fatalities and continue to double check my code for sanity parsing out a state at a time (Wyoming is a great test case for moving averages). I could experiment with different windows for the moving averages. I could work on better performance (particularly the loops). I could highlight likely lack of precision or accuracy given sparse data. Feel free to comment with suggestions!
For me the new challenges ahead: seeing if I can get Dash to work on my Mac: https://www.youtube.com/channel/UCqBFsuAz41sqWcFjZkqmJqQ and picking the next topic for research in areas that just sound fun.
The notebook for this code can be found here: https://github.com/AlisonDoucette/COVID-19--US-Maps