'How to replace all html tag with empty string in golang
I'm trying to replace all html tag such as <div> </div>
... on empty string ( " " ) in golang with regex pattern ^[^.\/]*$/g
to match all close tag. ex : </div>
My solution:
package main
import (
"fmt"
"regexp"
)
const Template = `^[^.\/]*$/g`
func main() {
r := regexp.MustCompile(Template)
s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"
res := r.ReplaceAllString(s, "")
fmt.Println(res)
}
But output the same source string. What's wrong? Please help. Thank
Expect Result should: "afsdf4534534!@@!!#345345afsdf4534534!@@!!#"
Solution 1:[1]
if you want replace all HTML TAG, using strip of html tag.
regex to match HTML tags is not good idea.
package main
import (
"fmt"
"github.com/grokify/html-strip-tags-go"
)
func main() {
text := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"
stripped := strip.StripTags(text)
fmt.Println(text)
fmt.Println(stripped)
}
Solution 2:[2]
For those who came here looking for a quick solution, there is a library that does this: bluemonday.
Package bluemonday provides a way of describing a whitelist of HTML elements and attributes as a policy, and for that policy to be applied to untrusted strings from users that may contain markup. All elements and attributes not on the whitelist will be stripped.
package main
import (
"fmt"
"github.com/microcosm-cc/bluemonday"
)
func main() {
// Do this once for each unique policy, and use the policy for the life of the program
// Policy creation/editing is not safe to use in multiple goroutines
p := bluemonday.StripTagsPolicy()
// The policy can then be used to sanitize lots of input and it is safe to use the policy in multiple goroutines
html := p.Sanitize(
`<a onblur="alert(secret)" href="http://www.google.com">Google</a>`,
)
// Output:
// Google
fmt.Println(html)
}
Solution 3:[3]
The Problem with RegEx
This is a very simple RegEx replace method that removes HTML tags from well-formatted HTML in a string.
strip_html_regex.go
package main
import "regexp"
const regex = `<.*?>`
// This method uses a regular expresion to remove HTML tags.
func stripHtmlRegex(s string) string {
r := regexp.MustCompile(regex)
return r.ReplaceAllString(s, "")
}
Note: this does not work well with malformed HTML. Don't use this.
A better way
Since a string in Go can be treated as a slice of bytes it makes walking through the string and finding portions that are not in an HTML tag easy. When we Identify a valid portion of the string we can simply take a slice of that portion and append it using a strings.Builder
.
strip_html.go
package main
import (
"strings"
"unicode/utf8"
)
const (
htmlTagStart = 60 // Unicode `<`
htmlTagEnd = 62 // Unicode `>`
)
// Aggressively strips HTML tags from a string.
// It will only keep anything between `>` and `<`.
func stripHtmlTags(s string) string {
// Setup a string builder and allocate enough memory for the new string.
var builder strings.Builder
builder.Grow(len(s) + utf8.UTFMax)
in := false // True if we are inside an HTML tag.
start := 0 // The index of the previous start tag character `<`
end := 0 // The index of the previous end tag character `>`
for i, c := range s {
// If this is the last character and we are not in an HTML tag, save it.
if (i+1) == len(s) && end >= start {
builder.WriteString(s[end:])
}
// Keep going if the character is not `<` or `>`
if c != htmlTagStart && c != htmlTagEnd {
continue
}
if c == htmlTagStart {
// Only update the start if we are not in a tag.
// This make sure we strip out `<<br>` not just `<br>`
if !in {
start = i
}
in = true
// Write the valid string between the close and start of the two tags.
builder.WriteString(s[end:start])
continue
}
// else c == htmlTagEnd
in = false
end = i + 1
}
s = builder.String()
return s
}
If we run these two functions with the OP's text and some malformed HTML you will see that the result is not consistent.
main.go
package main
import "fmt"
func main() {
s := "afsdf4534534!@@!!#<div>345345afsdf4534534!@@!!#</div>"
res := stripHtmlTags(s)
fmt.Println(res)
// Malformed HTML examples
fmt.Println("\n:: stripHTMLTags ::\n")
fmt.Println(stripHtmlTags("Do something <strong>bold</strong>."))
fmt.Println(stripHtmlTags("h1>I broke this</h1>"))
fmt.Println(stripHtmlTags("This is <a href='#'>>broken link</a>."))
fmt.Println(stripHtmlTags("I don't know ><where to <<em>start</em> this tag<."))
// Regex Malformed HTML examples
fmt.Println(":: stripHtmlRegex ::\n")
fmt.Println(stripHtmlRegex("Do something <strong>bold</strong>."))
fmt.Println(stripHtmlRegex("h1>I broke this</h1>"))
fmt.Println(stripHtmlRegex("This is <a href='#'>>broken link</a>."))
fmt.Println(stripHtmlRegex("I don't know ><where to <<em>start</em> this tag<."))
}
Output:
afsdf4534534!@@!!#345345afsdf4534534!@@!!#
:: stripHTMLTags ::
Do something bold.
I broke this
This is broken link.
start this tag
:: stripHtmlRegex ::
Do something bold.
h1>I broke this
This is >broken link.
I don't know >start this tag<.
Note: that the RegEx method does not remove all HTML tags consistently. To be honest, I am not good enough at RegEx to write a RegEx match string to properly handle stripping HTML.
Benchmarks
Aside from the advantage of being safer and more aggressive in the stripping of malformed HTML tags stripHtmlTags
is about 4 times faster than stripHtmlRegex
.
> go test -run=Calculate -bench=.
goos: windows
goarch: amd64
BenchmarkStripHtmlRegex-8 51516 22726 ns/op
BenchmarkStripHtmlTags-8 230678 5135 ns/op
Solution 4:[4]
Starting from @Daniel Morelli function, I have created another function with some more possibilities. I am sharing it here if it can be useful for someone:
//CreateCleanWords takes a string and returns a string array with all words in string
// rules:
// words of lenght >= of minAcceptedLenght
// everything between < and > is discarded
// admitted characters: numbers, letters, and all characters in validRunes map
// words not present in wordBlackList map
// word separators are space or single quote (could be improved with a map of separators)
func CreateCleanWords(s string) []string {
// Setup a string builder and allocate enough memory for the new string.
var builder strings.Builder
builder.Grow(len(s) + utf8.UTFMax)
insideTag := false // True if we are inside an HTML tag.
var c rune
var managed bool = false
var valid bool = false
var finalWords []string
var singleQuote rune = '\''
var minAcceptedLenght = 4
var wordBlackList map[string]bool = map[string]bool{
"sull": false,
"sullo": false,
"sulla": false,
"sugli": false,
"sulle": false,
"alla": false,
"all": false,
"allo": false,
"agli": false,
"alle": false,
"dell": false,
"della": false,
"dello": false,
"degli": false,
"delle": false,
"dall": false,
"dalla": false,
"dallo": false,
"dalle": false,
"dagli": false,
}
var validRunes map[rune]bool = map[rune]bool{
'à': true,
'è': true,
'é': true,
'ì': true,
'ò': true,
'ù': true,
'€': true,
'$': true,
'£': true,
'-': true,
}
for _, c = range s {
managed = false
valid = false
//show := string(c)
//fmt.Println(show)
// found < from here on ignore characters
if !managed && c == htmlTagStart {
insideTag = true
managed = true
valid = false
}
// found > characters are valid now
if !managed && c == htmlTagEnd {
insideTag = false
managed = true
valid = false
}
// if we are inside an HTML tag, we don't check anything because we won't take anything
// until we reach the tag end
if !insideTag {
if !managed && unicode.IsSpace(c) || c == singleQuote {
// found space if I have a valid word let's add it to word array
// only bigger than 3 letters
if builder.Len() >= minAcceptedLenght {
word := strings.ToLower((builder).String())
//first check if the word is not in a black list
if _, ok := wordBlackList[word]; !ok {
// the word is not in blacklist let's add to finalWords
finalWords = append(finalWords, word)
}
}
// make builder ready for next token
builder.Reset()
valid = false
managed = true
}
// letters and digits are welvome
if !managed {
valid = unicode.IsLetter(c) || unicode.IsDigit(c)
managed = valid
}
// other italian runes accepted
if !managed {
_, valid = validRunes[c]
}
if valid {
builder.WriteRune(c)
}
}
}
// remember to check the last word after exiting from for!
if builder.Len() > minAcceptedLenght {
//first check if the word is not in a black list
word := builder.String()
if _, ok := wordBlackList[word]; !ok {
// the word is not in blacklist let's add to finalWords
finalWords = append(finalWords, word)
}
builder.Reset()
}
return finalWords
}
Solution 5:[5]
Improvement on @Daniel Morell's answer. The only difference here is due to len
of string evaluation on all utf-8 char. It will return between 1-4 for each char used. So len(è)
would actually evaluate to 2
. To fix that, we will convert string to rune
.
https://go.dev/play/p/xo7Mrx5qw-_J
// Aggressively strips HTML tags from a string.
// It will only keep anything between `>` and `<`.
func stripHTMLTags(s string) string {
// Supports utf-8, since some char could take more than 1 byte. ie: len("è") -> 2
d := []rune(s)
// Setup a string builder and allocate enough memory for the new string.
var builder strings.Builder
builder.Grow(len(d) + utf8.UTFMax)
in := false // True if we are inside an HTML tag.
start := 0 // The index of the previous start tag character `<`
end := 0 // The index of the previous end tag character `>`
for i, c := range d {
// If this is the last character and we are not in an HTML tag, save it.
if (i+1) == len(d) && end >= start {
builder.WriteString(s[end:])
}
// Keep going if the character is not `<` or `>`
if c != htmlTagStart && c != htmlTagEnd {
continue
}
if c == htmlTagStart {
// Only update the start if we are not in a tag.
// This make sure we strip out `<<br>` not just `<br>`
if !in {
start = i
}
in = true
// Write the valid string between the close and start of the two tags.
builder.WriteString(s[end:start])
continue
}
// else c == htmlTagEnd
in = false
end = i + 1
}
s = builder.String()
return s
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | |
Solution 3 | |
Solution 4 | Alfonso Moscato |
Solution 5 | leogoesger |