'How do I write this Python code to use 2+ fewer nested if statements?

I have the following code which I use to loop through row groups in a parquet metadata file to find the maximum values for columns i,j,k across the whole file. As far as I know I have to find the max value in each row group.

I am looking for:

  • how to write it with at least two fewer levels of nesting
  • in fewer lines in general

I tried to use a dictionary lambda combo as a switch statement in place of some of the if statements, and eliminate at least two levels of nesting, but I couldn't figure out how to do the greater than evaluation without nesting further.

import pyarrow.parquet as pq


def main():
    metafile = r'D:\my_parquet_meta_file.metadata'
    meta = pq.read_metadata(metafile)

    max_i = 0
    max_j = 0
    max_k = 0

    for grp in range(0, meta.num_row_groups):
        for col in range(0, meta.num_columns):
            # locate columns i,j,k
            if meta.row_group(grp).column(col).path_in_schema in ['i', 'j', 'k']:
                if meta.row_group(grp).column(col).path_in_schema == 'i':
                    if meta.row_group(grp).column(col).statistics.max > max_i:
                        max_i = meta.row_group(grp).column(col).statistics.max
                if meta.row_group(grp).column(col).path_in_schema == 'j':
                    if meta.row_group(grp).column(col).statistics.max > max_j:
                        max_j = meta.row_group(grp).column(col).statistics.max
                if meta.row_group(grp).column(col).path_in_schema == 'k':
                    if meta.row_group(grp).column(col).statistics.max > max_k:
                        max_k = meta.row_group(grp).column(col).statistics.max

    print('max i: ' + str(max_i), 'max j: ' + str(max_j), 'max k: ' + str(max_k))


if __name__ == '__main__':
    main()


Solution 1:[1]

I've had someone give me 2 solutions:

The first involves using a list to hold the max values for each of my nominated columns, and then uses the python max function to evaluate the higher value before assigning it back. I must say I'm not a huge fan of using an unnamed positional max value variable, but it does the job in this instance and I can't fault it.

Solution 1:

import pyarrow.parquet as pq

def main():
    metafile = r'D:\my_parquet_meta_file.metadata'
    meta = pq.read_metadata(metafile)
    max_value = [0, 0, 0]
    for grp in range(0, meta.num_row_groups):
        for col in range(0, meta.num_columns):
            column = meta.row_group(grp).column(col)
            for i, name in enumerate(['i', 'j', 'k']):
                if column.path_in_schema == name:
                    max_value[i] = max(max_value[i], column.statistics.max)

    print(dict(zip(['max i', 'max j', 'max k'], max_value)))

if __name__ == '__main__':
    main()

The second uses similar methods, but additionally uses list comprehension to get all of of the column objects before iterating through each column object to find the column's max values. This removes one additional level of nesting but more importantly separates the gathering of columns objects into a separate collection before interrogating them, making the process a little clearer. I think on the downside is may require higher memory usage due to everything in the column object being retained rather than just the reported max value.

:

Solution 2:

import pyarrow.parquet as pq

def main():
    metafile = r'D:\my_parquet_meta_file.metadata'
    meta = pq.read_metadata(metafile)
    max_value = [0, 0, 0]
    columns = [meta.row_group(grp).column(col)
               for col in range(0, meta.num_columns)
               for grp in range(0, meta.num_row_groups)] # Apparently list generators are read right to left
    for column in columns:
        for i, name in enumerate(['i', 'j', 'k']):
            if column.path_in_schema == name:
                max_value[i] = max(max_value[i], column.statistics.max)
    print(dict(zip(['max i', 'max j', 'max k'], max_value)))

    if __name__ == '__main__':
    main()

*Update I've found out it actually uses less memory - the column object I mentioned, is actually a list generator not a list. It won't retrieve each column until it's called in the second loop where I enumerate through the "columns" list generator. The downside of using a list generator is you can only iterate through it once (it's not reusable) unless you redefine it. The upside is if I happen to want to "break" from the loop once I've found a desired value, I could and there would be no remaining list taking up memory and it would not need to have called every column object making it faster. In my case it doesn't really matter cause I do go through the whole list anyway, but with a lower memory foot print.

*Note the list generator here is a Python 3 feature, Python 2 would have returned the complete list for the same syntax

# In Python 3 this returns a list generator, in Python 2 it returns a populated lsit
columns = [meta.row_group(grp).column(col)
               for col in range(0, meta.num_columns)
               for grp in range(0, meta.num_row_groups)]

To get a populated list as you would in Python 2, requires the list() function e.g. columns = list([<generator expression ... >])

Solution 2:[2]

You can simulate a switch statement with the following function:

def switch(v):yield lambda *c:v in c

It simulates a switch statement using a single pass for loop with if/elif/else conditions that don't repeat the switching value:

for example:

for case in switch(x):
    if    case(3):     
          # ... do something
    elif  case(4,5,6): 
          # ... do something else
    else:              
          # ... do some other thing

It can also be used in a more C style:

for case in switch(x):

    if case(3):     
       # ... do something
       break

    if case(4,5,6): 
       # ... do something else
       break 
else:              
    # ... do some other thing

Here's how to use it with your code:

...
for case in switch(meta.row_group(grp).column(col).path_in_schema):
    if not case('i', 'j', 'k'): break
    statMax = meta.row_group(grp).column(col).statistics.max
    if   case('i') and statMax > max_i: max_i = statMax
    elif case('j') and statMax > max_j: max_j = statMax
    elif case('k') and statMax > max_k: max_k = statMax
...

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2