'How can I check if a block of multiple lines matches certain criteria, without loops?

I have a data set with 2 million lines, so loops are not an option. The problem is about as follows:

  • Each line is a transaction by a person.
  • A person can have multiple transactions.
  • A transaction can have multiple lines.
  • Transactions can have different types.

I need to check if the order of transaction types for each person is realistic. You can't close an account before you opened an account, stuff like that.

So basically:

PersonID TransID TransType
----------------------------
1        1       open
1        2       withdraw
1        2       withdraw
1        3       close
2        1       withdraw
2        1       withdraw
2        2       close    

Now Person 2 withdrew and closed an account without opening it. That's an error. So I want the index of the last line of person 2.

There are tons of rules, what can go before what and under what circumstances so what I need is like a way to code:

FROM INDEX a TO b, CHECK IF x OCCURS BEFORE y THEN REPEAT FROM INDEX b+1 TO c UNTIL WE ARE THROUGH THE ENTIRE DATASET

What exactly the result is, is not that important, if I get the IDs of the people, or a vector of the rows where a rule has been violated is not that important.

Any ideas?



Solution 1:[1]

I suggest you break this down into multiple steps,

Break out one data frame for each person,

then one for each transaction.

Then apply your rules appropriately. You will probably have some person level rules and some transaction level rules.

Here is some code to start you off.

    data <- data.frame(PersonID = c(1,1,1,1,2,2,2),  
                      TransID = c(1, 2, 2, 3, 1, 1, 1),
                      TransType = c("open", 
                                    "withdraw", 
                                    "withdraw", 
                                    "close", 
                                    "withdraw", 
                                    "withdraw", 
                                    "close"))
    
    result <- split.data.frame(data, data$PersonID) 
    
    result

This will return one data frame per person.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1