'Problem decoding h264 over RTP TCP stream

I'm trying to receive RTP stream encoding h264 over TCP from my intercom Hikvision DS-KH8350-WTE1. By reverse engineering I was able to replicate how Hikvision original software Hik-Connect on iPhone and iVMS-4200 on MacOS connects and negotaties streaming. Now I'm getting the very same stream as original apps - verified through Wireshark. Now I need to "make sense" of the stream. I know it's RTP because I inspected how iVMS-4200 uses it using /usr/bin/sample on MacOS. Which yields:

  ! :               2 CStreamConvert::InputData(void*, int)  (in libOpenNetStream.dylib) + 52  [0x11ff7c7a6]
+     ! :                 2 SYSTRANS_InputData  (in libSystemTransform.dylib) + 153  [0x114f917f2]
+     ! :                   1 CRTPDemux::ProcessH264(unsigned char*, unsigned int, unsigned int, unsigned int)  (in libSystemTransform.dylib) + 302  [0x114fa2c04]
+     ! :                   | 1 CRTPDemux::AddAVCStartCode()  (in libSystemTransform.dylib) + 47  [0x114fa40f1]
+     ! :                   1 CRTPDemux::ProcessH264(unsigned char*, unsigned int, unsigned int, unsigned int)  (in libSystemTransform.dylib) + 476  [0x114fa2cb2]
+     ! :                     1 CRTPDemux::ProcessVideoFrame(unsigned char*, unsigned int, unsigned int)  (in libSystemTransform.dylib) + 1339  [0x114fa29b3]
+     ! :                       1 CMPEG2PSPack::InputData(unsigned char*, unsigned int, FRAME_INFO*)  (in libSystemTransform.dylib) + 228  [0x114f961d6]
+     ! :                         1 CMPEG2PSPack::PackH264Frame(unsigned char*, unsigned int, FRAME_INFO*)  (in libSystemTransform.dylib) + 238  [0x114f972fe]
+     ! :                           1 CMPEG2PSPack::FindAVCStartCode(unsigned char*, unsigned int)  (in libSystemTransform.dylib) + 23  [0x114f97561]`

I can catch that with lldb and see the arriving packet data making sense as the format I'm describing.

The packet signatures look following:

0x24 0x02 0x05 0x85 0x80 0x60 0x01 0x57 0x00 0x00 0x00 0x02 0x00 0x00 0x27 0xde 0x0d 0x80 0x60 0x37 0x94 0x71 0xe3 0x97 0x10 0x77 0x20 0x2c 0x51 | 0x7c 0x85 0xb8 0x00 00 00 00 01 65 0xb8 0x0 0x0 0xa 0x35 ...

0x24 0x02 0x05 0x85 0x80 0x60 0x01 0x58 0x00 0x00 0x00 0x02 0x00 0x00 0x27 0xde 0xd 0x80 0x60 0x37 0x95 0x71 0xe3 0x97 0x10 0x77 0x20 0x2c 0x51 | 0x7c 0x05 0x15 0xac ...

0x24 0x02 0x5 0x85 0x80 0x60 0x01 0x59 0x00 0x0 0x00 0x02 0x00 00x0 0x27 0xde 0xd 0x80 0x60 0x37 0x96 0x71 0xe3 0x97 0x10 0x77 0x20 0x2c 0x51 | 0x7c 0x05 0x5d 0x00 ...

By the means of reverse engineering the original software I was able to figure out that 0x7c85 indicates a key frame. 0x7c85 bytes in the genuine software processing do get replaced by h264 00 00 00 01 65 Key frame NALU . That's h264 appendix-B format. The 0x7c05 packets always follow and are the remaining payload of the key frame. No NALU are added during their handling (the 0x7c05 is stripped away and rest of the bytes is copied). None of the bytes preceding 0x7cXX make it to a mp4 recording (that makes sense as it's the RTP protocol , albeit I'm not sure if it's entirely RTP standard or there's something custom from Hikvision).

If you pay close attention in the Header there are 2 separate bytes indicating order which always match, so I'm sure no packet loss is occurring.

I also observed nonkey frames arriving as 0x7c81 and converted to 00 00 00 01 61 NALU but I want to focus solely on the single key frame for now. Mostly because if I record a movie with the original software it will always begin with 00 00 00 01 65 Key frame (that obviously makes sense).

To get a working mp4 I decided to copy paste a mp4 header of a genuine iVMS-4200 recording (in this sense that's every byte preceding 1st frame NALU 00 00 00 01 65 in the mp4 file). I know that the resolution will match the actual camera footage. With the strategy of waiting for a keyframe , replacing 0x7c85 with 00 00 00 01 65 NALU and appending the remaining bytes, or only appending bytes in the 0x7c05 case I do seem to get something that eventually could work. When I attempt to ffplay the custom crafted mp4 result I do get something (with a little stretch of imagination that's actually the camera fisheye image forming), but clearly there is a problem.

enter image description here

It seems around 3-4th 0x7c05 packet (as the failing packet differs on every run), when I copy bytes eventually the h264 stream is incorrect. Just by eye-inspecting the bytes I don't see anything unusual.

This is the failing packet around offset 750 decimal, (I know it's around this place because I keep stripping bytes away to see if there's still same amount frame arriving before it breaks). enter image description here More over I did dump those bytes from original software using lldb taking out my own python implementation out of equation. And I run into very same problem with the original packets.

The mp4 header I use should work (since it does for original recordings even if I manipulate number of frames and leave just the first keyframe). Correct me if I'm wrong but the phase of converting this to MPEG2-PS (which iVMS-4200 does and I don't) should be entirely optional and should not be related to my problem.

Update: I went the path of setting up recording and only then dumping the original iVMS-4200 packets. I edited the recorded movie to only contain keyframe of interest and it works. I found differences but I cannot explain where they are there yet: enter image description here Somehow 00 00 01 E0 13 FA 88 00 02 FF FF is inserted in the genuine recording (that's 4th packet), but I have no idea how this byte string was generated and what is its purpose. When I fixed the first difference the next one is: enter image description here The pattern is striking. But what00 00 01 E0 13 FA 88 00 02 FF FF actually is? And why is it inserted after 18 03 25 10 & 2F F4 80 FC
The 00 00 01 E0 signature would suggest those are Packetized Elementary Stream (PES) headers



Solution 1:[1]

Going for mp4 container wasn't a good choice after all. It turns out the RTP essentially yields raw h264 stream. To inspect its structure I converted the genuine mp4 recording to .264 like this:

ffmpeg -i recording.mp4 -codec copy recording.264

It's essentially a PPS (00 00 00 01 67) and SPS 00 00 00 01 68 followed by frame data NALUs that I got in the stream.

Raw h264 turned out a way simpler structure to aim at, and I don't have to deal with those Packetized Elementary Stream (PES) headers anymore. That yields correct image. In my case I just took the PPS & SPS settings that original recordings use from recording.264. That could definitely be resolved dynamically somehow but I didn't bother.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1