'HIVE CBO. Wrong results with Hive SQL query with MULTIPLE IN conditions in where clause

I am running one SQL query in Hive and it gives different results with CBO enabled and disabled. The results are wrong when CBO is enabled (set hive.cbo.enable=true;).

Prerequisites: Apache Hadoop 2.10.1 + Apache Hive 2.3.6 installed. (I tried to reproduce the issue with Apache Hive 3+ version and Hadoop 3+ version and they work fine.)

Actions to reproduce:

1) Create the table in Hive

CREATE TABLE sample( fname string, sname string, sid string) ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' TBLPROPERTIES ( 'transient_lastDdlTime'='1616020251');

2) Insert some records in the table the below values

insert into sample values ("PQ", "F", "33");

insert into sample values ("RK", "A", "11");

insert into sample values ("AZ", "B", "22");

3) Check the table

select * from sample;

PQ F 33
RK A 11
AZ B 22

4) Run the below query. It should give 0 records, instead it gives all records

select fname from sample where sid in ('11') and sid in ('22');

PQ
RK
AZ

5) The results are good when we disable cbo in Hive

set hive.cbo.enable=false;

select fname from sample where sid in ('11') and sid in ('22');

OK Time taken: 0.131 seconds



Solution 1:[1]

The wrong results come from a wrong simplification in the HivePointLookupOptimizerRule [1]. This corresponds to the bug logged under HIVE-21685 [2] that was fixed in versions >= 3.2.0.

As a workaround you can disable the respective rule by setting the respective property to false.

set hive.optimize.point.lookup=false

[1] https://github.com/apache/hive/blob/62834fbdd3bd4065413b59448759c6a25aa1dbf0/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HivePointLookupOptimizerRule.java

[2] https://issues.apache.org/jira/browse/HIVE-21685

Solution 2:[2]

Datatype is String use INT

And the simplest solution is do not use CBO is problem still remains but if you still want to use check these things after altering the datatype

1.Privilages 2.Version control 3.auto tuned

Solution 3:[3]

Use in query to select multiple item

select fname from sample where sid in ('11','22');

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 zabetak
Solution 2 Zoe stands with Ukraine
Solution 3 Rupesh Singh