'How can I download all tags in the 'Posts' table?
If I write "select Tags from Posts WHERE Tags IS NOT NULL" on site https://data.stackexchange.com I can only download 50,000 rows. How can I get all rows?
Solution 1:[1]
This doesn't full solve your problem, but at least it phrases the logic in a more useful way.
You are querying the wrong table for this type of problem. Instead:
with pt as (
select pt.tagid, pt.postid, t.tagname
from posttags pt join
tags t
on pt.tagid = t.id
)
select pt.tagname, pt2.tagname, count(*)
from pt join
pt pt2
on pt.postid = pt2.postid and
pt.tagname < pt2.tagname
group by pt.tagname, pt2.tagname
order by count(*) desc;
You are still limited to 50,000 rows, but you are letting the database do more of the work.
The Stack Overflow SEDE is limited to 50,000 rows. That is simply how it works. A few years ago, they moved the data over to BigQuery -- unfortunately modifying the schema a bit (I would be really curious to know why).
In any case, you can return large result sets on BigQuery. The equivalent query is:
#standardSQL
with pt as (
select pq.*, tag
from `bigquery-public-data.stackoverflow.posts_questions` pq cross join
unnest(split(pq.tags, '|')) tag
)
select pt.tag as tag1, pt2.tag as tag2, count(*) as cnt
from pt join
pt pt2
on pt2.id = pt.id and pt.tag < pt2.tag
group by 1, 2
order by count(*) desc;
This returns all 4,689,465 rows of results -- which you can save in a BigQuery table or export to a file and move to another environment.
You can signup for a BigQuery account and run this.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Peter Mortensen |