'sqoop merge-key creating multiple part files instead of one which doesn't serve the purpose of using merge-key

Ideally, when we run incremental without merge-key it will create new file with the appended data set but if we use merge-key then it will create new whole data set including the previous dataset in one file only. But I am not getting one part file when I use incremental append in my sqoop job. Below are my steps:

1) Initial data:

mysql> select * from departments_per;
+---------------+-----------------+
| department_id | department_name |
+---------------+-----------------+
|             2 | Fitness         |
|             3 | Footwear        |
|             4 | Apparel         |
|             5 | Golf            |
|             6 | Outdoors        |
|             7 | Fan Shop        |
+---------------+-----------------+  

2) sqoop command to import data into hdfs initially:

sqoop import \
--connect jdbc:mysql://localhost/practice \
--username root \
--password cloudera \
--table departments_per \
--target-dir /departments \
-m 1

Now when I see the directory departments under hdfs I can see one part file which is fine.

3) Now I update my initial data in mysql:

mysql> select * from departments_demo;
+---------------+-----------------+
| department_id | department_name |
+---------------+-----------------+
|             2 | Fitness         |
|             3 | Footwear        |
|             4 | Apparel         |
|             5 | Golf            |
|             6 | Outdoors        |
|             7 | Fan             |
|             8 | Tushar MC       |
+---------------+-----------------+

4) Now I create incremental append job and execute it:

sqoop job --create appendJobs12 \
-- import \
--connect jdbc:mysql://localhost/practice \
--username root \
--password cloudera \
--table departments_demo \
-m 1 \
--target-dir /departments \
--incremental append \
--merge-key department_id \
--check-column department_id \
--last-value 0 

sqoop job --exec appendJobs12   

5) I can see two part files in hdfs directory even though I used merge-key concept.

[cloudera@quickstart ~]$ hadoop fs -ls /departments
Found 3 items
-rw-r--r--   1 cloudera supergroup          0 2018-10-04 00:31 /departments/_SUCCESS
-rw-r--r--   1 cloudera supergroup         60 2018-10-04 00:31 /departments/part-m-00000
-rw-r--r--   1 cloudera cloudera           67 2018-10-04 00:37 /departments/part-m-00001  

When I display data it looks like below:

[cloudera@quickstart ~]$ hadoop fs -cat /departments/part-m-00000
2,Fitness
3,Footwear
4,Apparel
5,Golf
6,Outdoors
7,Fan Shop 
[cloudera@quickstart ~]$ hadoop fs -cat /departments/part-m-00001
    2,Fitness
    3,Footwear
    4,Apparel
    5,Golf
    6,Outdoors
    7,Fan
    8, Tushar MC

where one part file holds the initial data and the second one holds the updated data. Can anybody tell me where am I going wrong because of which I am not able to get one part file with the updated dataset. Thanks in advance



Solution 1:[1]

I researched this and got the same error with your approach. So, it is not correct - reading the various stuff I have to say I think it is not so clear. Anyway.

I get the impression that external tables and some LINUX scripting is required with intermediate target directories.

So, 1) I added the data in mysql and 2) did a sqoop import followed by 3) an update in mysql followed by 4) another incremental import as you did which is not really correct I think but you are wanting updates so OK, followed by 5) codegen and lastly 6) sqoop MERGE.

These are the main steps:

Initial import

sqoop import -m 1 --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table ged  --warehouse-dir=/user/hive/warehouse --create-hive-table --hive-import --fields-terminated-by ',' --hive-drop-import-delims

Incremental load as you did

sqoop import -m 1 --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table ged  --warehouse-dir=/user/hive/warehouse --fields-terminated-by ',' --hive-drop-import-delims --last-value 0 --merge-key id --incremental append --check-column id

Codegen

 sqoop codegen --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table ged

Merge

 sqoop merge --new-data /user/hive/warehouse/ged --onto /user/hive/warehouse/new_ged --merge-key id --target-dir /user/hive/merged/ged2 --jar-file /tmp/sqoop-cloudera/compile/c8d374075351d228c50d89354959762e/ged.jar -class-name ged

Results: [cloudera@quickstart ~]$ hadoop fs -cat /user/hive/merged/ged2/* 1,YYY 2,Peter 3,Bobby 4,Maria 5,Joke 6,Joker

whereby I originally had 1, XXX but not 6,Joker

This is somewhat different, so I am not sure what to state. In any event 1 file is not a valid assumption with lots of data. Your statement may work with extra parameters but this works as well.

The clue here is updating of immutable systems via merge requiring different targets that can be switched via external table commands in terms of locations.

Solution 2:[2]

There is always good time to Answer and it is Never Late !!! Just refreshing my memories and Hope you will get it .

I can see many things in the above incremental import sqoop command :

  1. The merge key concept was introduced only for "lastmodified" incremental load type , so that the updated record can be taken care- But above requirement is used for append mode. *Always In append mode, New mapper partition file is created: /departments/part-m-00000 /departments/part-m-00001
  2. To get in one part file- use lastmodified" incremental load only.
  3. merge-key is not recommended to run through the sqoop jobs-technically command will work BUT main objective of merge-key will not be fulfilled. 4)This is only place -where in sqoop command , reducer process is invoked. /departments/part-r-00000 ( r -reducer ) ---> if the merge-key is ran for lastmodified incremental load.

sqoop import
--connect jdbc:mysql://localhost/practice
--username root
--password cloudera
--table departments_demo
--m 1
--target-dir /departments
--incremental lastmodified\ ===>changes made --merge-key department_id \
--check-column \ ==>ONLY date and timestamp datatype is allowed --last-value 0

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 ShekharAgarwal