'Deleting multiple rows under same App Name but with different number of reviews
I have a dataframe having many columns, 2 of them being 'App' and 'Reviews'. I discovered that for the same app there are multiple rows because they differ in the number of reviews. Naturally one has to go with the row having highest number of reviews assuming it to be the latest one. For example:
Now there are many such apps who has multiple rows so it is not possible to edit them manually. First I found out how many times each app is occurring through: value_counts() function, and converted it into a dictionary such that the app name becomes the key and its count, corresponding value. For example:
{'ROBLOX:9', '8 Ball Pool:7', 'Bubble Shooter:6', 'Helix Jump:6'}
Then I created the following nested for loop to check each app and keep only that observation with highest review.
It gives me an error for this line--> if temp_df.iloc[temp_indices]['Reviews'] != max_review:
saying: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Solution 1:[1]
You don't need to create a dictionary for it and loop. it is a bit curcuitious.
here are 3 ways you can solve this. the 1st & 2nd solution will leave you with exactly one row for each App
while the 3rd solution would keep multiple rows if the max
value occurs more than once.
(1)
df.loc[df.groupby('App')['reviews'].idxmax(),:]
(2)
df.sort_values(by=['App','reviews'],ascending=[True,False]).drop_duplicates('App',keep='first')
(3)
df.loc[df['reviews'] == df.groupby('App')['reviews'].transform('max')]
About your error. You try to compare a number/string to a Series which isn't possible.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Rabinzel |