概要

Google Cloud Skills Boostの[Insights from Data with BigQuery]クエストのチャレンジラボの簡単な手順の解説です。

このチャレンジラボは、COVID19に関するオープンデータセットを利用します。BigQueryでCOVID19オープンデータセットのテーブルにSQLを実行して、COVID19についての統計分析を行います。

クエリー内のマーカーでハイライトしている部分は、ラボによって変わる変数です。ラボのときには、置き換えてください。

Task 1. Total confirmed cases

指定した日付(date)の全世界のトータル感染者数を返すクエリーを実行します。

SELECT sum(cumulative_confirmed) as total_cases_worldwide
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE date='date'

Task 2. Worst affected areas

指定した日付(date)のアメリカ国内で死者数(DeathCount)以上の州がいくつあるかを調べるクエリーを実行します。

with deaths_by_states as (
SELECT subregion1_name as state, sum(cumulative_deceased) as death_count
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name="United States of America" and date='date' and subregion1_name is NOT NULL
 group by subregion1_name
)
select count(*) as count_of_states
from deaths_by_states
where death_count > DeathCount

Task 3. Identifying hotspots

指定した日付(date)のアメリカ国内で感染者数(ConfirmedCases)以上の州をリストアップするクエリーを実行します。

SELECT * FROM (
    SELECT subregion1_name as state, sum(cumulative_confirmed) as total_confirmed_cases
    FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
    WHERE country_code="US" AND date='date' AND subregion1_name is NOT NULL
    GROUP BY subregion1_name
    ORDER BY total_confirmed_cases DESC
)
WHERE total_confirmed_cases > ConfirmedCases

Task 4. Fatality ratio

イタリアの2020年の指定月(Month)における致死率を求めるクエリーを実行します。

SELECT sum(cumulative_confirmed) as total_confirmed_cases,
sum(cumulative_deceased) as total_deaths,
(sum(cumulative_deceased)/sum(cumulative_confirmed))*100 as case_fatality_ratio
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name="Italy" AND date BETWEEN 'Month_start_day'and 'Month_end_day'

Task 5. Identifying specific day

イタリアの死者数が指定数(DeathCount)を超えた日を求めるクエリーを実行します。

SELECT date
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name="Italy" and cumulative_deceased > DeathCount
order by date asc
limit 1

Task 6. Finding days with zero net new cases

指定した日付の間(StartDate/CloseDate)において、インドで感染者数がゼロの日を求めるために以下のクエリーを考えていますが、正しく実行できません。

WITH india_cases_by_date AS (
  SELECT
    date,
    SUM(cumulative_confirmed) AS cases
  FROM
    `bigquery-public-data.covid19_open_data.covid19_open_data`
  WHERE
    country_name="India"
    AND date between 'StartDate' and 'CloseDate'
  GROUP BY
    date
  ORDER BY
    date ASC
 )
, india_previous_day_comparison AS
(SELECT
  date,
  cases,
  LAG(cases) OVER(ORDER BY date) AS previous_day,
  cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases
FROM india_cases_by_date
)

正しいクエリーとして実行します。

WITH india_cases_by_date AS (
  SELECT
    date,
    SUM(cumulative_confirmed) AS cases
  FROM
    `bigquery-public-data.covid19_open_data.covid19_open_data`
  WHERE
    country_name="India"
    AND date between 'StartDate' and 'CloseDate'
  GROUP BY
    date
  ORDER BY
    date ASC
 )
, india_previous_day_comparison AS
(SELECT
  date,
  cases,
  LAG(cases) OVER(ORDER BY date) AS previous_day,
  cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases
FROM india_cases_by_date
)
select count(*)
from india_previous_day_comparison
where net_new_cases=0

Task 7. Doubling rate

Task 6のクエリーをテンプレートとして利用します。アメリカ国内で2020/3/22から2020/4/20までの間、前日よりも(LimitValue)%以上感染者数が増加した日を求めるためのクエリーを実行します。

WITH us_cases_by_date AS (
  SELECT
    date,
    SUM(cumulative_confirmed) AS cases
  FROM
    `bigquery-public-data.covid19_open_data.covid19_open_data`
  WHERE
    country_name="United States of America"
    AND date between '2020-03-22' and '2020-04-20'
  GROUP BY
    date
  ORDER BY
    date ASC
 )
, us_previous_day_comparison AS
(SELECT
  date,
  cases,
  LAG(cases) OVER(ORDER BY date) AS previous_day,
  cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases,
  (cases - LAG(cases) OVER(ORDER BY date))*100/LAG(cases) OVER(ORDER BY date) AS percentage_increase
FROM us_cases_by_date
)
select Date, cases as Confirmed_Cases_On_Day, previous_day as Confirmed_Cases_Previous_Day, percentage_increase as Percentage_Increase_In_Cases
from us_previous_day_comparison
where percentage_increase > LimitValue

Task 8. Recovery rate

2020/5/10までで回復率が上位の国から(LimitValue)カ国リストするクエリーを実行します。感染者数が50K以上の国が対象です。

WITH cases_by_country AS (
  SELECT
    country_name AS country,
    sum(cumulative_confirmed) AS cases,
    sum(cumulative_recovered) AS recovered_cases
  FROM
    bigquery-public-data.covid19_open_data.covid19_open_data
  WHERE
    date = '2020-05-10'
  GROUP BY
    country_name
 )
, recovered_rate AS
(SELECT
  country, cases, recovered_cases,
  (recovered_cases * 100)/cases AS recovery_rate
FROM cases_by_country
)
SELECT country, cases AS confirmed_cases, recovered_cases, recovery_rate
FROM recovered_rate
WHERE cases > 50000
ORDER BY recovery_rate desc
LIMIT LimitValue

Task 9. CDGR – Cumulative daily growth rate

フランスにおける指定日付(Date)のCDGR(Cumulative Daily Growth Rate)を求めるクエリーを実行します。以下のクエリーは正しく実行できません。

WITH
  france_cases AS (
  SELECT
    date,
    SUM(cumulative_confirmed) AS total_cases
  FROM
    `bigquery-public-data.covid19_open_data.covid19_open_data`
  WHERE
    country_name="France"
    AND date IN ('2020-01-24',
      '2020-05-10')
  GROUP BY
    date
  ORDER BY
    date)
, summary as (
SELECT
  total_cases AS first_day_cases,
  LEAD(total_cases) AS last_day_cases,
  DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff
FROM
  france_cases
LIMIT 1
)
select first_day_cases, last_day_cases, days_diff, SQRT((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr
from summary

クエリーを修正して実行します。

WITH
  france_cases AS (
  SELECT
    date,
    SUM(cumulative_confirmed) AS total_cases
  FROM
    `bigquery-public-data.covid19_open_data.covid19_open_data`
  WHERE
    country_name="France"
    AND date IN ('2020-01-24',
      'Date')
  GROUP BY
    date
  ORDER BY
    date)
, summary as (
SELECT
  total_cases AS first_day_cases,
  LEAD(total_cases) OVER(ORDER BY date) AS last_day_cases,
  DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff
FROM
  france_cases
LIMIT 1
)
select first_day_cases, last_day_cases, days_diff, POWER((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr
from summary

Task 10. Create a Looker Studio report

アメリカ国内で以下のデータをプロットするチャートをLooker Studioで作成します。

  • Number of Confirmed Cases
  • Number of Deaths
  • Date range : DataRange

SELECT
  date, SUM(cumulative_confirmed) AS country_cases,
  SUM(cumulative_deceased) AS country_deaths
FROM
  `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
  date BETWEEN 'DataRangeStart'
  AND 'DataRangeClose'
  AND country_name ="United States of America"
GROUP BY date

クエリー結果をLooker Studioにエクスポートして、チャートを作成します。