'Python Pandas : how to combine trip segments into a journey with Transport smart card data
Currently working with an interesting transport smart card dataset. Each line in the current data represent a trip (e.g. bus trip from A to B). Any trips within 60 min needs to be grouped into journey.
The current table:
CustomerID SegmentID Origin Dest StartTime EndTime Fare Type
0 A001 101 A B 7:30am 7:45am 1.5 Bus
1 A001 102 B C 7:50am 8:30am 3.5 Train
2 A001 103 C B 17:10pm 18:00pm 3.5 Train
3 A001 104 B A 18:10pm 18:30pm 1.5 Bus
4 A002 105 K Y 11:30am 12:30pm 3.0 Train
5 A003 106 P O 10:23am 11:13am 4.0 Ferrie
and covert into something like:
CustomerID JourneyID Origin Dest Start Time End Time Fare Type NumTrips
0 A001 1 A C 7:30am 8:30am 5 Intermodal 2
1 A001 2 C A 17:10pm 18:30pm 5 Intermodal 2
2 A002 6 K Y 11:30am 12:30pm 3 Train 1
3 A003 8 P O 10:23am 11:13am 4 Ferrie 1
I'm new to Python and Pandas and have no idea how to start, so any guidance would be appreciated.
Solution 1:[1]
Here's a fairly complete answer. You didn't fully specify the concept of a single journey so I took a guess. You could adjust mask below to better suit your own definition.
# get rid of am/pm and convert to proper datetime
# converts to year 1900 b/c it's not specified, doesn't matter here
df['StTime'] = pd.to_datetime( df.StartTime.str[:-2], format='%H:%M' )
df['EndTime'] = pd.to_datetime( df.EndTime.str[:-2], format='%H:%M' )
# some of the later processing is easier if you use duration
# instead of arrival time
df['Duration'] = df.EndTime-df.StTime
# get rid of some nuisance variables for clarity
df = df[['CustomerID','Origin','Dest','StTime','Duration','Fare','Type']]
First, we need to figure out a way group the rows. As this is not well specified in the question, I'll group by Customer ID where Start Times are within 1 hr. Note that for tri-modal trips this actually implies that start times of the first and third trips could differ by more than one hour as long as first+second and second+third are each individaully under 1 hour. This seems like a natural way to do it, but for your actual use case you'd have to adjust this for your desired definition. There are quite a few ways you could proceed here.
mask1 = df.StTime - df.StTime.shift(1) <= pd.Timedelta('01:00:00')
mask2 = (df.CustomerID == df.CustomerID.shift(1))
mask = ( mask1 & mask2 )
Now we can use the mask with cumsum to generate a tripID:
df['JourneyID'] = 1
df.ix[mask,'JourneyID'] = 0
df['JourneyID'] = df['JourneyID'].cumsum()
df['NumTrips'] = 1
df[['CustomerID','StTime','Fare','JourneyID']]
CustomerID StTime Fare JourneyID
0 A001 1900-01-01 07:30:00 1.5 1
1 A001 1900-01-01 07:50:00 3.5 1
2 A001 1900-01-01 17:10:00 3.5 2
3 A001 1900-01-01 18:10:00 1.5 2
4 A002 1900-01-01 11:30:00 3.0 3
5 A003 1900-01-01 10:23:00 4.0 4
Now, for each column just aggregate appropriately:
df2 = df.groupby('JourneyID').agg({ 'Origin' : sum, 'CustomerID' : min,
'Dest' : sum, 'StTime' : min,
'Fare' : sum, 'Duration' : sum,
'Type' : sum, 'NumTrips' : sum })
StTime Dest Origin Fare Duration Type CustomerID NumTrips
JourneyID
1 1900-01-01 07:30:00 BC AB 5 00:55:00 BusTrain A001 2
2 1900-01-01 17:10:00 BA CB 5 01:10:00 TrainBus A001 2
3 1900-01-01 11:30:00 Y K 3 01:00:00 Train A002 1
4 1900-01-01 10:23:00 O P 4 00:50:00 Ferrie A003 1
Note that Duration includes only travel time and not the time in-between trips (e.g. if start time of second trip is later than the end time of first trip).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |