Table of Contents
Overview
This is a brief step-by-step description of the Google Cloud Skills Boost’s [Insights from Data with BigQuery] Quest challenge lab.
In this Challenge Lab, we will use an open dataset on COVID19 and run SQL on the tables in the COVID19 open dataset in BigQuery to perform a statistical analysis on COVID19.
The highlighted with markers in the query are variables that change from lab to lab. Please replace them when you are in the lab.
Task 1. Total confirmed cases
Executes a query that returns the total number of infected persons worldwide for a given date (date).
SELECT sum(cumulative_confirmed) as total_cases_worldwide FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE date='date'
Task 2. Worst affected areas
Executes a query to determine how many states in the United States have a death count (DeathCount) or greater on a given date (date).
with deaths_by_states as ( SELECT subregion1_name as state, sum(cumulative_deceased) as death_count FROM `bigquery-public-data.covid19_open_data.covid19_open_data` where country_name="United States of America" and date='date' and subregion1_name is NOT NULL group by subregion1_name ) select count(*) as count_of_states from deaths_by_states where death_count > DeathCount
Task 3. Identifying hotspots
Runs a query that lists states with more than the number of infected cases (ConfirmedCases) in the United States on a given date (date).
SELECT * FROM ( SELECT subregion1_name as state, sum(cumulative_confirmed) as total_confirmed_cases FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE country_code="US" AND date='date' AND subregion1_name is NOT NULL GROUP BY subregion1_name ORDER BY total_confirmed_cases DESC ) WHERE total_confirmed_cases > ConfirmedCases
Task 4. Fatality ratio
Run a query to find the fatality rate for a given month (Month) in 2020 in Italy.
SELECT sum(cumulative_confirmed) as total_confirmed_cases, sum(cumulative_deceased) as total_deaths, (sum(cumulative_deceased)/sum(cumulative_confirmed))*100 as case_fatality_ratio FROM `bigquery-public-data.covid19_open_data.covid19_open_data` where country_name="Italy" AND date BETWEEN 'Month_start_day'and 'Month_end_day'
Task 5. Identifying specific day
Run a query to find the days when the number of deaths in Italy exceeded a specified number (DeathCount).
SELECT date FROM `bigquery-public-data.covid19_open_data.covid19_open_data` where country_name="Italy" and cumulative_deceased > DeathCount order by date asc limit 1
Task 6. Finding days with zero net new cases
We are trying to figure out the following query to find the days with zero number of infections in India during the specified dates (StartDate/CloseDate), but we are unable to execute it correctly.
WITH india_cases_by_date AS ( SELECT date, SUM(cumulative_confirmed) AS cases FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE country_name="India" AND date between 'StartDate' and 'CloseDate' GROUP BY date ORDER BY date ASC ) , india_previous_day_comparison AS (SELECT date, cases, LAG(cases) OVER(ORDER BY date) AS previous_day, cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases FROM india_cases_by_date )
Runs as a correct query.
WITH india_cases_by_date AS ( SELECT date, SUM(cumulative_confirmed) AS cases FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE country_name="India" AND date between 'StartDate' and 'CloseDate' GROUP BY date ORDER BY date ASC ) , india_previous_day_comparison AS (SELECT date, cases, LAG(cases) OVER(ORDER BY date) AS previous_day, cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases FROM india_cases_by_date ) select count(*) from india_previous_day_comparison where net_new_cases=0
Task 7. Doubling rate
Use the Task 6 query as a template. Run a query to find the days between 3/22/2020 and 4/20/2020 in the U.S. when the number of infections increased by (LimitValue)% or more than the previous day.
WITH us_cases_by_date AS ( SELECT date, SUM(cumulative_confirmed) AS cases FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE country_name="United States of America" AND date between '2020-03-22' and '2020-04-20' GROUP BY date ORDER BY date ASC ) , us_previous_day_comparison AS (SELECT date, cases, LAG(cases) OVER(ORDER BY date) AS previous_day, cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases, (cases - LAG(cases) OVER(ORDER BY date))*100/LAG(cases) OVER(ORDER BY date) AS percentage_increase FROM us_cases_by_date ) select Date, cases as Confirmed_Cases_On_Day, previous_day as Confirmed_Cases_Previous_Day, percentage_increase as Percentage_Increase_In_Cases from us_previous_day_comparison where percentage_increase > LimitValue
Task 8. Recovery rate
Run a query that lists (LimitValue) countries from the country with the highest recovery rate through 5/10/2020. Countries with more than 50K infected persons are eligible.
WITH cases_by_country AS ( SELECT country_name AS country, sum(cumulative_confirmed) AS cases, sum(cumulative_recovered) AS recovered_cases FROM bigquery-public-data.covid19_open_data.covid19_open_data WHERE date = '2020-05-10' GROUP BY country_name ) , recovered_rate AS (SELECT country, cases, recovered_cases, (recovered_cases * 100)/cases AS recovery_rate FROM cases_by_country ) SELECT country, cases AS confirmed_cases, recovered_cases, recovery_rate FROM recovered_rate WHERE cases > 50000 ORDER BY recovery_rate desc LIMIT LimitValue
Task 9. CDGR – Cumulative daily growth rate
Run a query to find the CDGR (Cumulative Daily Growth Rate) for a given date (Date) in France. The following queries cannot be run correctly.
WITH france_cases AS ( SELECT date, SUM(cumulative_confirmed) AS total_cases FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE country_name="France" AND date IN ('2020-01-24', '2020-05-10') GROUP BY date ORDER BY date) , summary as ( SELECT total_cases AS first_day_cases, LEAD(total_cases) AS last_day_cases, DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff FROM france_cases LIMIT 1 ) select first_day_cases, last_day_cases, days_diff, SQRT((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr from summary
Modify and run the query.
WITH france_cases AS ( SELECT date, SUM(cumulative_confirmed) AS total_cases FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE country_name="France" AND date IN ('2020-01-24', 'Date') GROUP BY date ORDER BY date) , summary as ( SELECT total_cases AS first_day_cases, LEAD(total_cases) OVER(ORDER BY date) AS last_day_cases, DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff FROM france_cases LIMIT 1 ) select first_day_cases, last_day_cases, days_diff, POWER((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr from summary
Task 10. Create a Looker Studio report
Create a chart in Looker Studio that plots the following data within the United States.
- Number of Confirmed Cases
- Number of Deaths
- Date range : DataRange
SELECT date, SUM(cumulative_confirmed) AS country_cases, SUM(cumulative_deceased) AS country_deaths FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE date BETWEEN 'DataRangeStart' AND 'DataRangeClose' AND country_name ="United States of America" GROUP BY date
Export query results to Looker Studio for charting.