'parse pcap file with scapy

I am comparing scapy and dpkt in terms of speed. I have a directory with pcap files which I parse and count the http requests in each file. Here's the scapy code :

import time
from scapy.all import *

def parse(f):
 x = 0
 pcap = rdpcap(f)
 for p in pcap:
    try:
        if p.haslayer(TCP) and p.getlayer(TCP).dport == 80 and p.haslayer(Raw):
            x = x + 1
    except:
        continue
print x

if __name__ == '__main__':\

  path = '/home/pcaps'
  start = time.time()
  for file in os.listdir(path):
    current = os.path.join(path, file)
    print current
    f = open(current)
    parse(f)
    f.close()
 end = time.time()
 print (end - start)

The script is really slow (it gets stuck after a few minutes) compared to the dpkt version :

import dpkt
import time
from os import walk
import os
import sys


def parse(f):
 x = 0
 try:
    pcap = dpkt.pcap.Reader(f)
 except:
    print "Invalid Header"
    return
 for ts, buf in pcap:
        try:
            eth = dpkt.ethernet.Ethernet(buf)
        except:
            continue
        if eth.type != 2048:
             continue
        try:
            ip = eth.data
        except:
            continue

        if ip.p == 6:
            if type(eth.data) == dpkt.ip.IP:
                tcp = ip.data


                if tcp.dport == 80:
                    try:
                        http = dpkt.http.Request(tcp.data)
                        x = x+1
                    except:
                        continue

print x

if __name__ == '__main__':

path = '/home/pcaps'
start = time.time()
for file in os.listdir(path):
    current = os.path.join(path, file)
    print current
    f = open(current)
    parse(f)
    f.close()
end = time.time()
print (end - start)

So it there something wrong with the way I am using scapy? Or is it just that scapy is slower than dpkt?



Solution 1:[1]

You inspired me to compare. 2 GB PCAP. Dumb test. Simply counting the number of packets.

I'd expect this to be in single digit minutes with C++ / libpcap just based on previous timings of similar sized files. But this is something new. I wanted to prototype first. My velocity is generally higher in Python.

For my application, streaming is the only option. I'll be reading several of these PCAPs simultaneously and doing computations based on their contents. Can't just hold in memory. So I'm only comparing streaming calls.

scapy 2.4.5:

from scapy.all import *
import datetime

i=0
print(datetime.datetime.now())
for packet in PcapReader("/my.pcap"):
    i+=1
else:
    print(i)
    print(datetime.datetime.now())

dpkt 1.9.7.2:

import datetime
import dpkt
print(datetime.datetime.now())
with open(pcap_file, 'rb') as f:
    pcap = dpkt.pcap.Reader(f)
    i=0
    for timestamp, buf in pcap:
        i+=1
    else:
        print(i)
        print(datetime.datetime.now())

Results:

Packet count is the same. So that's good. :-)

dkpt - Just under 10 minutes.

scapy - 35 minutes.

dkpt went first. So if disk cache were helping a package, it would be scapy. And I think it might be marginally. I did this previously with scapy only, and it was over 40 minutes.

In summary, thanks for your 5 year old question. It's still relevant today. I almost bailed on Python here because of the overly long read speeds from scapy. dkpt seems substantially more performant.

Side note, alternative packages:

https://pypi.org/project/python-libpcap/ I'm on python 3.10 and 0.4.0 seems broken for me, unfortunately.

https://pypi.org/project/libpcap/ I'd like to compare timings to this, but have found it much harder to get a minimal example going. Haven't spent much time though, to be fair.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Evan