'What determines the item order when converting a set to a list?
I know there are several similar questions, but I haven't found one yet that talks about what I would like to know. If this is a duplicate, please point it out.
So I know a set
in Python is an unordered collection, while a list
can be sorted. What I am wondering about is what determins how the items of a list are ordered when the list was converted from a set.
Even though a set is 'technically' not sorted (which I guess just means that you cannot interact with it like you would with sequence types), there is still an order in the sense that when you print a set e.g., there has to be an item that is printed first, second, third and so on. This kind of logic needs to exist. But it goes even further than that. E.g. if you declare two sets in a 'scrambled' state that contain items that can be sorted, not only are their representations sorted when you execute them, the union of two 'scrabled' sets returns a 'sorted' set as well:
a = {2, 3, 1}
a
# >>> {1, 2, 3}
b = {7, 4, 5}
b
# >>> {4, 5, 7}
a|b
# >>> {1, 2, 3, 4, 5, 7}
b|a
# >>> {1, 2, 3, 4, 5, 7}
Also when you add
a new item to a set and print the set, the new item appears in the correct place, i.e. the place where it should be if the set were sorted:
b.add(6)
b
# >>> {4, 5, 6, 7}
This brings me to my question. If you convert the sets to lists, something has to determine at what position each item of the set is placed in the new list. But it is appearently NOT the same logic that determines in which order the items are printed when executing a set, which is what I naively thought. While list(a)
, list(b)
and even list(a|b)
all return lists that are sorted the way the sets are represented, for the following set (and all its permutations by the way), this is for some reason not the case:
list(a), list(b), list(a|b)
# >>> ([1, 2, 3], [4, 5, 6, 7], [1, 2, 3, 4, 5, 6, 7])
c = {3, 4, 9} # or {3, 9, 4}, {4, 3, 9} and so on...
c
# >>> {3, 4, 9}
list(c)
# >>> [9, 3, 4]
Why is that? Why is the logic that determines in which way the representation of a set is ordered different from the one that determines where each item of a set goes when the set is converted to a list?
I tried a few more sets with different values, and to me, it seems completely random for when the order of the representation of the set and the order of the list of the set are the same:
# for this set of numbers, the logic is different
d = {3, 4, 11}
d
# >>> {3, 4, 11}
list(d)
# >>> [11, 3, 4]
# in this case, permutations also result in different sorting of the list
d = {11, 4, 3}
d
# >>> {3, 4, 11}
list(d)
# >>> [3, 11, 4]
# for this set, the logic appears to be the same again
e = {3, 4, 13} # or any of its permutations
e
# >>> {3, 4, 13}
list(e)
# >>> [3, 4, 13]
The logic that determines the order of the list and calling print(set)
seems to be the same:
list(d)
# >>> [3, 11, 4]
print(d)
# >>> {3, 11, 4}
So I guess as soon as you do something with the set, a different sorting logic is applied. Unless of course, you create the union:
print(c, d, c|d, list(c|d))
# >>> {9, 3, 4} {3, 11, 4} {3, 4, 9, 11} [3, 4, 9, 11]
f = {3, 4, 9, 11}
f
# >>> {3, 4, 9, 11}
list(f)
# >>> [11, 9, 3, 4]
If you are wondering about the use case: as I said, I naively thought that the sorting would stay the same when converting a set to a list, when in reality, it does not. The wrong sorting caused an error when running my code. Luckily, it is easy to fix by using sorted(set)
instead of list(set)
, but it took some time to figure out the mistake in the first place.
So with this question I am trying to understand what is going on rather than looking for a solution.
Solution 1:[1]
I'm on Python 3.7.4.
and all my list(set)
order coincides with repr(set)
order. Here's a quick test (code) on 10000 samples:
import random
import pandas as pd
# create a function to generate random set of 0-999 with len of 3 - 20
f = lambda: set(random.randrange(1000) for i in range(random.randrange(3, 21)))
# create a DataFrame of 10000 rows with random sets
df = pd.DataFrame({'sets': [f() for i in range(10000)]})
# Create a column of repr(set) and retrieve the order in str
df['reprs'] = df['sets'].apply(repr).str.strip('{}')
# Create a column of list(set) and retrieve the order in str
df['lists'] = df['sets'].apply(list).astype(str).str.strip('[]')
# Create a comparison column
df['match'] = df['reprs'].eq(df['lists'])
# Take a look of the reprs and lists...
print(df[['reprs', 'lists']])
# Summarize
summary = df.groupby('match')['sets'].count()
print(summary)
Results:
match True 10000 Name: sets, dtype: int64
So I'd imagine if anything you want to pay attention to how set
is represented, which is an implementation detail per the initial comment.
Solution 2:[2]
I believe what OP observes is the effect of interned values.
Consider:
>>> list({*range(2,99), True, False})
[False, True, 2, 3, 4, 5, 6, 7, 8, 9, ...]
However:
>>> list({*range(2002,2099), True, False})
[2048, 2049, ..., True, False, ..., 2047]
Yet it doesn't fully explain OP's own observation:
>>> list({3,4,9, True, False})
[False, True, 3, 4, 9]
>>> list({3,4,9})
[9, 3, 4]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | r.ook |
Solution 2 | Dima Tisnek |