'How to query CSV using pure spark sql
I hope to get output from spark-sql CLI. But the data is in CSV which is separated by "\t". Is there any way to do this using pure sql?
cmd like: spark-sql -e 'select * from csv.`xxx` where xxx=xxx'
The only way is to create a temp view first and then query the view?
Solution 1:[1]
you can create table for existing data with all options, like, this:
create table if not exists table_name
using csv
OPTIONS (header "true", inferSchema "true", path ".....csv");
and the query it in next calls. Or you can have this statement in the .sql
file, and then provide file name via -i
option to be performed before your select
query.
Solution 2:[2]
This command works using delimiter "t" and creating a table
spark-sql -e 'create table if not exists benchmark_eraseme
using csv
OPTIONS (header "true", delimiter "\t", path "/home/ec2-user/demo/customer_benchmark.csv");
select * from benchmark_eraseme;'
File:
cat ~/demo/customer_benchmark.csv
id_customer description
49 client_1
8 Benchmark Gold
6 Benchmark Silver
7 Benchmark Regular
Output:
Time taken: 8.467 seconds
49 client_1
8 Benchmark Gold
6 Benchmark Silver
7 Benchmark Regular
Time taken: 0.548 seconds, Fetched 4 row(s)
I tried using a temporary view but at least in Spark 3.2.1 version it is not possible to create temporary views from CSV file.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Alex Ott |
Solution 2 | Luis Estrada |