'Python Pandas : how to combine trip segments into a journey with Transport smart card data

Currently working with an interesting transport smart card dataset. Each line in the current data represent a trip (e.g. bus trip from A to B). Any trips within 60 min needs to be grouped into journey.

The current table:

  CustomerID  SegmentID Origin Dest StartTime  EndTime  Fare    Type
0       A001        101      A    B    7:30am   7:45am   1.5     Bus
1       A001        102      B    C    7:50am   8:30am   3.5   Train
2       A001        103      C    B   17:10pm  18:00pm   3.5   Train
3       A001        104      B    A   18:10pm  18:30pm   1.5     Bus
4       A002        105      K    Y   11:30am  12:30pm   3.0   Train
5       A003        106      P    O   10:23am  11:13am   4.0  Ferrie

and covert into something like:

  CustomerID  JourneyID Origin Dest Start Time End Time  Fare        Type  NumTrips
0       A001          1      A    C     7:30am   8:30am     5  Intermodal         2
1       A001          2      C    A    17:10pm  18:30pm     5  Intermodal         2
2       A002          6      K    Y    11:30am  12:30pm     3       Train         1
3       A003          8      P    O    10:23am  11:13am     4      Ferrie         1

I'm new to Python and Pandas and have no idea how to start, so any guidance would be appreciated.



Solution 1:[1]

Here's a fairly complete answer. You didn't fully specify the concept of a single journey so I took a guess. You could adjust mask below to better suit your own definition.

# get rid of am/pm and convert to proper datetime
# converts to year 1900 b/c it's not specified, doesn't matter here
df['StTime'] = pd.to_datetime( df.StartTime.str[:-2], format='%H:%M' )
df['EndTime'] = pd.to_datetime( df.EndTime.str[:-2], format='%H:%M' )

# some of the later processing is easier if you use duration
# instead of arrival time
df['Duration'] = df.EndTime-df.StTime

# get rid of some nuisance variables for clarity
df = df[['CustomerID','Origin','Dest','StTime','Duration','Fare','Type']]

First, we need to figure out a way group the rows. As this is not well specified in the question, I'll group by Customer ID where Start Times are within 1 hr. Note that for tri-modal trips this actually implies that start times of the first and third trips could differ by more than one hour as long as first+second and second+third are each individaully under 1 hour. This seems like a natural way to do it, but for your actual use case you'd have to adjust this for your desired definition. There are quite a few ways you could proceed here.

mask1 = df.StTime - df.StTime.shift(1) <= pd.Timedelta('01:00:00')
mask2 = (df.CustomerID == df.CustomerID.shift(1))
mask = ( mask1 & mask2 )    

Now we can use the mask with cumsum to generate a tripID:

df['JourneyID'] = 1
df.ix[mask,'JourneyID'] = 0
df['JourneyID'] = df['JourneyID'].cumsum()
df['NumTrips'] = 1

df[['CustomerID','StTime','Fare','JourneyID']]

  CustomerID              StTime  Fare  JourneyID
0       A001 1900-01-01 07:30:00   1.5          1
1       A001 1900-01-01 07:50:00   3.5          1
2       A001 1900-01-01 17:10:00   3.5          2
3       A001 1900-01-01 18:10:00   1.5          2
4       A002 1900-01-01 11:30:00   3.0          3
5       A003 1900-01-01 10:23:00   4.0          4

Now, for each column just aggregate appropriately:

df2 = df.groupby('JourneyID').agg({ 'Origin' : sum, 'CustomerID' : min,
                                    'Dest'   : sum, 'StTime'     : min,
                                    'Fare'   : sum, 'Duration'   : sum,
                                    'Type'   : sum, 'NumTrips'   : sum })

                      StTime Dest Origin Fare Duration     Type CustomerID NumTrips
JourneyID                                                                               
1        1900-01-01 07:30:00   BC     AB    5 00:55:00 BusTrain       A001        2
2        1900-01-01 17:10:00   BA     CB    5 01:10:00 TrainBus       A001        2
3        1900-01-01 11:30:00    Y      K    3 01:00:00    Train       A002        1
4        1900-01-01 10:23:00    O      P    4 00:50:00   Ferrie       A003        1

Note that Duration includes only travel time and not the time in-between trips (e.g. if start time of second trip is later than the end time of first trip).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1