'How to get TONE from GDELT GKG table from Google BigQuery?

SELECT
DATE,
EXTRACT(YEAR FROM DATE) AS year,
FIPS as Country,
LOCATIONS,
AVG(TONE) as Avg_Tone,
AVG(Positive Score) as PositiveS,
AVG(Negative Score) as NegativeS,
COUNT(*),
From `gdelt-bq.gdeltv2.gkg_partitioned`,
`gdelt-bq.extra.sourcesbycountry` country,

Where
DATE(_PARTITIONTIME) BETWEEN TIMESTAMP('2002-01-01') AND TIMESTAMP('2020-12-31')
AND SourceCommonName=country.Domain
AND Location like '%CH%'
GROUP BY Year,Country
ORDER BY  Year,Country

Codebook link is http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_Codebook-V2.1.pdf The V1.5TONE has TONE, Positive Score and Negative Score and so on. I want to count the average Tone by year. How to get it from Big Query.



Solution 1:[1]

There is the need to cast and split the fields first.

  • The date is a value formated as "yyyymmdd....". Therefore, I suggest casting the value as a string and consider the first four characters as the year.

  • There is no V1.5TONE column, but V2Tone. It consists of a string with several digital numbers seperated by a comma. The string has to be split first. Then each component needs to be casted a decimal number.

  • The table gdelt-bq.extra.sourcesbycountry should map an url to a single country. It has duplicated countries to one url! To eliminate at least some duplicated value the inner select with the group by is used.

  • To obtain the values grouped by Year,Country all other dimensional columns need to be commented out.

SELECT
#DATE,
substr(cast(date as string),0,4) AS year,
FIPS as Country,
#LOCATIONS,
AVG(cast(split(V2Tone,",")[safe_offset(0)] as decimal )) as Avg_Tone,
AVG(cast(split(V2Tone,",")[safe_offset(1)] as decimal )) as PositiveS,
AVG(cast(split(V2Tone,",")[safe_offset(2)] as decimal )) as NegativeS,
COUNT(*) as counts,
From `gdelt-bq.gdeltv2.gkg_partitioned`
left join
(SELECT Domain, FIPS 
from `gdelt-bq.extra.sourcesbycountry`
group by 1,2) country
on  SourceCommonName=country.Domain

Where
DATE(_PARTITIONTIME) BETWEEN DATE('2020-01-01') AND DATE('2020-01-31')
AND Locations like '%CH%'
GROUP BY Year,Country
ORDER BY  Year,Country

Also in where segment: Do not mix DATE and TIMESTAMP.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Samuel