Overview

This is a brief step-by-step description of the Google Cloud Skills Boost’s [Insights from Data with BigQuery] Quest challenge lab.

In this Challenge Lab, we will use an open dataset on COVID19 and run SQL on the tables in the COVID19 open dataset in BigQuery to perform a statistical analysis on COVID19.

The highlighted with markers in the query are variables that change from lab to lab. Please replace them when you are in the lab.

Task 1. Total confirmed cases

Executes a query that returns the total number of infected persons worldwide for a given date (date).

SELECT sum(cumulative_confirmed) as total_cases_worldwide
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE date='date'

Task 2. Worst affected areas

Executes a query to determine how many states in the United States have a death count (DeathCount) or greater on a given date (date).

with deaths_by_states as (
SELECT subregion1_name as state, sum(cumulative_deceased) as death_count
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name="United States of America" and date='date' and subregion1_name is NOT NULL
 group by subregion1_name
)
select count(*) as count_of_states
from deaths_by_states
where death_count > DeathCount

Task 3. Identifying hotspots

Runs a query that lists states with more than the number of infected cases (ConfirmedCases) in the United States on a given date (date).

SELECT * FROM (
    SELECT subregion1_name as state, sum(cumulative_confirmed) as total_confirmed_cases
    FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
    WHERE country_code="US" AND date='date' AND subregion1_name is NOT NULL
    GROUP BY subregion1_name
    ORDER BY total_confirmed_cases DESC
)
WHERE total_confirmed_cases > ConfirmedCases

Task 4. Fatality ratio

Run a query to find the fatality rate for a given month (Month) in 2020 in Italy.

SELECT sum(cumulative_confirmed) as total_confirmed_cases,
sum(cumulative_deceased) as total_deaths,
(sum(cumulative_deceased)/sum(cumulative_confirmed))*100 as case_fatality_ratio
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name="Italy" AND date BETWEEN 'Month_start_day'and 'Month_end_day'

Task 5. Identifying specific day

Run a query to find the days when the number of deaths in Italy exceeded a specified number (DeathCount).

SELECT date
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name="Italy" and cumulative_deceased > DeathCount
order by date asc
limit 1

Task 6. Finding days with zero net new cases

We are trying to figure out the following query to find the days with zero number of infections in India during the specified dates (StartDate/CloseDate), but we are unable to execute it correctly.

WITH india_cases_by_date AS (
  SELECT
    date,
    SUM(cumulative_confirmed) AS cases
  FROM
    `bigquery-public-data.covid19_open_data.covid19_open_data`
  WHERE
    country_name="India"
    AND date between 'StartDate' and 'CloseDate'
  GROUP BY
    date
  ORDER BY
    date ASC
 )
, india_previous_day_comparison AS
(SELECT
  date,
  cases,
  LAG(cases) OVER(ORDER BY date) AS previous_day,
  cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases
FROM india_cases_by_date
)

Runs as a correct query.

WITH india_cases_by_date AS (
  SELECT
    date,
    SUM(cumulative_confirmed) AS cases
  FROM
    `bigquery-public-data.covid19_open_data.covid19_open_data`
  WHERE
    country_name="India"
    AND date between 'StartDate' and 'CloseDate'
  GROUP BY
    date
  ORDER BY
    date ASC
 )
, india_previous_day_comparison AS
(SELECT
  date,
  cases,
  LAG(cases) OVER(ORDER BY date) AS previous_day,
  cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases
FROM india_cases_by_date
)
select count(*)
from india_previous_day_comparison
where net_new_cases=0

Task 7. Doubling rate

Use the Task 6 query as a template. Run a query to find the days between 3/22/2020 and 4/20/2020 in the U.S. when the number of infections increased by (LimitValue)% or more than the previous day.

WITH us_cases_by_date AS (
  SELECT
    date,
    SUM(cumulative_confirmed) AS cases
  FROM
    `bigquery-public-data.covid19_open_data.covid19_open_data`
  WHERE
    country_name="United States of America"
    AND date between '2020-03-22' and '2020-04-20'
  GROUP BY
    date
  ORDER BY
    date ASC
 )
, us_previous_day_comparison AS
(SELECT
  date,
  cases,
  LAG(cases) OVER(ORDER BY date) AS previous_day,
  cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases,
  (cases - LAG(cases) OVER(ORDER BY date))*100/LAG(cases) OVER(ORDER BY date) AS percentage_increase
FROM us_cases_by_date
)
select Date, cases as Confirmed_Cases_On_Day, previous_day as Confirmed_Cases_Previous_Day, percentage_increase as Percentage_Increase_In_Cases
from us_previous_day_comparison
where percentage_increase > LimitValue

Task 8. Recovery rate

Run a query that lists (LimitValue) countries from the country with the highest recovery rate through 5/10/2020. Countries with more than 50K infected persons are eligible.

WITH cases_by_country AS (
  SELECT
    country_name AS country,
    sum(cumulative_confirmed) AS cases,
    sum(cumulative_recovered) AS recovered_cases
  FROM
    bigquery-public-data.covid19_open_data.covid19_open_data
  WHERE
    date = '2020-05-10'
  GROUP BY
    country_name
 )
, recovered_rate AS
(SELECT
  country, cases, recovered_cases,
  (recovered_cases * 100)/cases AS recovery_rate
FROM cases_by_country
)
SELECT country, cases AS confirmed_cases, recovered_cases, recovery_rate
FROM recovered_rate
WHERE cases > 50000
ORDER BY recovery_rate desc
LIMIT LimitValue

Task 9. CDGR – Cumulative daily growth rate

Run a query to find the CDGR (Cumulative Daily Growth Rate) for a given date (Date) in France. The following queries cannot be run correctly.

WITH
  france_cases AS (
  SELECT
    date,
    SUM(cumulative_confirmed) AS total_cases
  FROM
    `bigquery-public-data.covid19_open_data.covid19_open_data`
  WHERE
    country_name="France"
    AND date IN ('2020-01-24',
      '2020-05-10')
  GROUP BY
    date
  ORDER BY
    date)
, summary as (
SELECT
  total_cases AS first_day_cases,
  LEAD(total_cases) AS last_day_cases,
  DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff
FROM
  france_cases
LIMIT 1
)
select first_day_cases, last_day_cases, days_diff, SQRT((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr
from summary

Modify and run the query.

WITH
  france_cases AS (
  SELECT
    date,
    SUM(cumulative_confirmed) AS total_cases
  FROM
    `bigquery-public-data.covid19_open_data.covid19_open_data`
  WHERE
    country_name="France"
    AND date IN ('2020-01-24',
      'Date')
  GROUP BY
    date
  ORDER BY
    date)
, summary as (
SELECT
  total_cases AS first_day_cases,
  LEAD(total_cases) OVER(ORDER BY date) AS last_day_cases,
  DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff
FROM
  france_cases
LIMIT 1
)
select first_day_cases, last_day_cases, days_diff, POWER((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr
from summary

Task 10. Create a Looker Studio report

Create a chart in Looker Studio that plots the following data within the United States.

  • Number of Confirmed Cases
  • Number of Deaths
  • Date range : DataRange

SELECT
  date, SUM(cumulative_confirmed) AS country_cases,
  SUM(cumulative_deceased) AS country_deaths
FROM
  `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
  date BETWEEN 'DataRangeStart'
  AND 'DataRangeClose'
  AND country_name ="United States of America"
GROUP BY date

Export query results to Looker Studio for charting.