'Grouping duplicates in CSV file and ranking data based on certain values
I have a CSV file like so -
"user_id","age","liked_ad","location"
2145,34,true,USA
6786,25,true,UK
9025,21,false,USA
1145,40,false,UK
The csv file goes on. I worked out that there are duplicate user_id's within the file and so what I am trying to do is find out which users have the most 'true' answers for the 'liked_ads' column. I am super stuck on how to do this in Java and would appreciate any help.
This is what I have so far to literally just parse the file -
public static void main(String[] args) throws FileNotFoundException
{
Scanner scanner = new Scanner(new File("src/main/resources/advert-data.csv"));
scanner.useDelimiter(",");
while (scanner.hasNext()) {
System.out.print(scanner.next() + " | ");
}
scanner.close();
}
I'm stuck on where to go from here in order to achieve what I am trying to achieve.
Solution 1:[1]
You can store the frequency of true
value of liked_ad
for each user_id
in a Map<String, Integer> map
and then sort the Map
on values.
import java.io.File;
import java.io.IOException;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.Scanner;
public class Main {
public static void main(String[] args) throws IOException {
Scanner scanner = new Scanner(new File("file.txt"));
// Ignore the header line
if (scanner.hasNextLine()) {
scanner.nextLine();
}
// Store the frequency of liked_ad for each user_id
Map<String, Integer> map = new HashMap<>();
while (scanner.hasNextLine()) {
String[] data = scanner.nextLine().split(",");
if (data.length >= 3 && Boolean.parseBoolean(data[2])) {
map.merge(data[0], 1, Integer::sum);
}
}
// Sort the Map on values and display each entry
map.entrySet().stream().sorted(Collections.reverseOrder(Map.Entry.comparingByValue()))
.forEach(System.out::println);
}
}
Given the following data in the file:
"user_id","age","liked_ad","location"
1145,40,true,UK
2145,34,true,USA
6786,25,true,UK
6786,25,true,UK
1145,40,true,UK
2145,34,true,USA
9025,21,false,USA
1145,40,false,UK
1145,40,true,UK
the output will be
1145=3
6786=2
2145=2
Solution 2:[2]
Following code should do what you want to achive:
public static void main(String[] args) throws IOException {
SortedMap<String, Integer> stats = new TreeMap<>(Collections.reverseOrder());
Files.readAllLines(Paths.get(args[0])).forEach((line) -> {
String[] columns = line.split(",");
if (Boolean.valueOf(columns[2])) {
stats.compute(columns[0], (key, value) -> value == null ? 1 : value + 1);
}
});
for (Entry<String, Integer> entry : stats.entrySet()) {
System.out.println(entry.getKey() + ": " + entry.getValue());
}
}
Solution 3:[3]
Retrieve the CSV file, group it by user_id, count records whose third column is true in each group, find groups where the count is greater than 0, and then sort records by the count in descending order. The code will be lengthy if you try to code the process in Java.
I suggest you using SPL, the open-source Java package to do this. It is simple and only one line of code is enough:
A | |
---|---|
1 | =file("advert-data.csv").import@cqt().groups(user_id;count(#3==true):count).select(#2>0).sort(-#2) |
SPL offers JDBC driver to be invoked by Java. Just store the above SPL script as rank.splx and invoke it in Java as you call a stored procedure:
…
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
st = con.prepareCall("call rank()");
st.execute();
…
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Basil Bourque |
Solution 2 | rmunge |
Solution 3 |