'Sharing an Oracle table among Spark Nodes using Python

I have an huge Oracle table to process, so I define a list of where clauses to read by each Spark node. In the middle of the processing I need to join the data in each node to the small Oracle table. I have several idea to do that:

  1. Join the huge table to the small one, and then each node reads its own data based on the where clauses.
  2. Broadcast the small Oracle table among nodes.
  3. In each node, read part of the huge table based on "where" clause and then read the small table. After that join the small table to the data that each node has.

I test first way, but it takes so long because the result of joining two table, makes huge result and it takes a lot of time to process even though , it read based on "where" clause. My code is here:

hash_function = lambda x: 'ora_hash({}, {})'.format(x, num_partitions)
dates_df = connection.read_sql('SELECT distinct VerifiedDate D FROM {}'.format(source_table_name)) 
dates = list(dates_df.loc[:, 'D'])
cst_hash_df = connection.read_sql('SELECT distinct {} hash FROM {}'.format(
    hash_function('cst_id'), source_table_name))  
cst_hash_values = list(cst_hash_df.loc[:, 'HASH'])
table_name = '''(select r_trn.*,acc.ACC_CST_TYPE_ID from r_trn 
                 left join acc on r_trn.cst_id=acc.cst_id) dbtable'''
predicates = [
    "to_date(VerifiedDate,'YYYYMMDD','nls_calendar=persian')= to_date({} ,'YYYYMMDD','nls_calendar=persian') and ora_hash(customer_id,{}) = {}"
        .format(d, num_partitions, cst) for d in dates for cst in
dataframe = spark.read \
    .option('driver', 'oracle.jdbc.driver.OracleDriver') \
    .jdbc(url='jdbc:oracle:thin:user/[email protected]:1521:orcl',
          table=table_name,
          predicates=predicates)
 unique_df = dataframe.rdd.mapPartitions(check_data.unique).toDF(schema=y)
 ...

Would you please guide me how to share the small Oracle table among Spark nodes?

Any help is really appreciated.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source