'How to convert binary string to normal string in php
Description of the problem
I am trying to import email content into a database table. Sometimes, I get an SQL error while inserting a message. I found that it fails when the message content is a binary string instead of a string. For exemple, I get this in the console if I print a message that is imported successfully (Truncated)
However, I get this with problematic import:
I found out that if I use the function utf8_encode
, I am successfully able to import it into SQL. The problem is that it "breaks" previously successfull imports accented characters:
What I have tried
- Detect if the string was a binary string with
ctype_print
, returned false for both non binary and binary string. I would have then be able to callutf8_encode
only if it was binary - Use of
unpack
, did not work - Detect string encoding with
mb_detect_encoding
, returnUTF-8
for both - use
iconv
, failed withiconv(): Detected an illegal character in input string
- Cast the content as string using
(string)
/settype($html, 'string')
Question How can I transform the binary string in a normal string so I can then import it in my database without breaking accented characters in other imports?
Solution 1:[1]
This is pretty late, but for anyone else reading... Apparently the b
prefix is meaningless in PHP, it's a bit of a red herring. See: https://stackoverflow.com/a/51537602/6111743
What encodings did you pass to iconv()
? This is the correct solution but you have to give it the correct first argument, which depends on your input. In my example I use "LATIN1"
because that turned out to be the correct way to interpret my input but your use case may vary.
You can use mb_check_encoding()
to check if it is valid UTF-8 or not. This returns a boolean.
Assuming the question is really something like "how to convert extended ascii string to valid utf-8 string in PHP" - Here is how I did it in my application:
if(!mb_check_encoding($string)) {
$string = iconv("LATIN1", "UTF-8//TRANSLIT//IGNORE", $string);
}
The "TRANSLIT" part tells it to attempt transliteration, that's optional for you. The "IGNORE" will prevent it from throwing Detected an illegal character in input string
if it does detect one; instead the character will just get ignored, meaning, removed. Your use case may not need either of these.
When you're debugging, I recommend just using "UTF-8" as the second argument so you can see what it's doing. It's useful to see if it throws an error. For me, I had given it the wrong first argument at first (I wrote "ASCII" instead of "LATIN-1") and it threw the illegal character error on an accented character. That error went away once I passed it the correct encoding.
By the way, mb_detect_encoding()
was no help to me in figuring out that Latin-1 was what I needed. What helped was dumping the contents of unpack("C*", $string)
to see what exact bytes were in there. That's more debugging advice than solution but worth mentioning in case it helps.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | cheryllium |