'Python create all combinations of data points and filter them based on function
I have a table of locations (right now in dataframe) and want to calculate all combinations and their distance from eachother.
Input:
ID | Lat | Lon |
---|---|---|
1 | 6,4355 | 53,2245 |
2 | 5,3434 | 50,2345 |
3 | 4,3434 | 51,2345 |
Desired Outcome:
ID1 | ID2 | distance |
---|---|---|
1 | 1 | 0 |
1 | 2 | 1 |
1 | 3 | 2 |
2 | 1 | 0 |
2 | 2 | 3 |
2 | 3 | 4 |
3 | 1 | 0 |
3 | 2 | 5 |
3 | 3 | 6 |
def distance(lat1, lon1, lat2, lon2):
lat1 = radians(lat1)
lon1 = radians(lon1)
lat2 = radians(lat2)
lon2 = radians(lon2)
dlon = lon2 - lon1
dlat = lat2 - lat1
R = 6373.0
a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
return round(R * c)
Right now i loop through the dataframe 2x in such an ugly way that i'm not even going to show, but it works. Problem is that it is terribly slow when the table gets big and i know there must be a faster way to do this.
If i can do this in standard python/pandas/numpy (as long as its fast and i dont have to use obscure packages!) Any help would be much appreciated Oh and i want to filter on distance < 10km, forgot to add!!
Here my current code i want to improve:
df_distance = pandas.DataFrame(columns=['ID1', 'ID2', 'distance'])
""" first all id with themselves """
for index, row in df.iterrows():
df_new_row = pandas.DataFrame([{'ID1': row['ID'], 'ID2': row['ID'],
'distance': 0, 'lat1': row['Lat'], 'lon1': row['Lon'],
'lat2': row['Lat'], 'lon2': row['Lon']}])
df_distance = pandas.concat([df_distance, df_new_row])
for index1, row1 in df.iterrows():
for index2, row2 in df.iterrows():
if index2 > index1:
dist = distance(row1['Lat'], row1['Lon'], row2['Lat'], row2['Lon'])
if dist <= 10: # filter at lower than 10km
""" add both directions """
df_new_row = pandas.DataFrame([{'ID1': row1['ID'], 'ID2': row2['ID'],
'distance': dist, 'lat1': row1['Lat'], 'lon1': row1['Lon'],
'lat2': row2['Lat'], 'lon2': row2['Lon']},
{'ID1': row2['ID'], 'ID2': row1['ID'],
'distance': dist, 'lat1': row2['Lat'], 'lon1': row2['Lon'],
'lat2': row1['Lat'], 'lon2': row1['Lon']}
])
df_distance = pandas.concat([df_distance, df_new_row])
Solution 1:[1]
Generally use from itertools import combinations
.
Example in that case:
>>> from itertools import combinations
>>> a = [1, 2, 3, 4, 5]
>>> for c1, c2 in combinations(a, 2):
... print(c1, c2)
...
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5
Of course you can use 'key', list comprehensions etc. to get correct values depending on your input, but programming is still about solving puzzles - you now have everything you need :)
Little offtop-warning:
Calculating all combinations (full routes) is O(n!), which basically means if you have ~ > 30 points (depending on your computer), forget about calculating it in your lifetime. But it should be fine for each pair, depending how many of them you have, but it's just O(n2) :)
@Edit: Generally you won't reduce O(n2) complexity, but generating numpy two dimensional matrix and calculate distance across this structure will speedup the process a lot, because numpy pushes slices of data to processor cache which is bottleneck in regular iterative problems. Consider if data exceeded your RAM it'll be slow anyway, so you should ensure your data to calculate is as compact as possible for large data, to don't hold anything that's unnecessary.
Other thing you might consider, is just use popular methods to do this in more threads, just split the data and merge it at finish.
Generally you can find a book in google about optimizing your code in 'low level' ways :)
Solution 2:[2]
@jop I was unable to answer on my own, so I formulated your question differently, please check this solution:
Fastest way in numpy to get distance of product of n pairs in array
Filtering results to filter efficiently every result above 10km could be easily done by this code fragment:
def filter_where(result, var=11):
return result[np.where(result < var)]
To load data from pandas dataframe you can follow these topics:
Selecting multiple columns in a Pandas dataframe
Convert Select Columns in Pandas Dataframe to Numpy Array
I hope it solves your performance issue, please let know in comments how solving puzzles went :)
To solve puzzle you need only know how to mark correct indices to don't lose your data which point with which is combined.
PS: I believe that should be at least 60 times faster then your current solution.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Guaz |