C #을 포함한 문자열에서 HTML 태그 제거

developer tip

C #을 포함한 문자열에서 HTML 태그 제거

optionbox 2020. 9. 23. 07:31

C #을 포함한 문자열에서 HTML 태그 제거

C #에서 regex를 사용하여 & nbsp를 포함한 모든 HTML 태그를 제거하려면 어떻게해야합니까? 내 문자열은 다음과 같습니다.

  "<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"

HTML 파서 지향 솔루션을 사용하여 태그를 필터링 할 수없는 경우 여기에 간단한 정규식이 있습니다.

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

이상적으로는 여러 공백을 처리하는 정규식 필터를 통해 또 다른 패스를 만들어야합니다.

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");

@Ravi Thapliyal의 코드를 가져 와서 방법을 만들었습니다. 간단하고 모든 것을 정리하지는 않을 수도 있지만 지금까지 필요한 작업을 수행하고 있습니다.

public static string ScrubHtml(string value) {
    var step1 = Regex.Replace(value, @"<[^>]+>|&nbsp;", "").Trim();
    var step2 = Regex.Replace(step1, @"\s{2,}", " ");
    return step2;
}

이 기능을 한동안 사용하고 있습니다. 던질 수있는 지저분한 html을 거의 제거하고 텍스트는 그대로 둡니다.

        private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled);

        //add characters that are should not be removed to this regex
        private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\\?=|%!() -]", RegexOptions.Compiled);

        public static String UnHtml(String html)
        {
            html = HttpUtility.UrlDecode(html);
            html = HttpUtility.HtmlDecode(html);

            html = RemoveTag(html, "<!--", "-->");
            html = RemoveTag(html, "<script", "</script>");
            html = RemoveTag(html, "<style", "</style>");

            //replace matches of these regexes with space
            html = _tags_.Replace(html, " ");
            html = _notOkCharacter_.Replace(html, " ");
            html = SingleSpacedTrim(html);

            return html;
        }

        private static String RemoveTag(String html, String startTag, String endTag)
        {
            Boolean bAgain;
            do
            {
                bAgain = false;
                Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase);
                if (startTagPos < 0)
                    continue;
                Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase);
                if (endTagPos <= startTagPos)
                    continue;
                html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length);
                bAgain = true;
            } while (bAgain);
            return html;
        }

        private static String SingleSpacedTrim(String inString)
        {
            StringBuilder sb = new StringBuilder();
            Boolean inBlanks = false;
            foreach (Char c in inString)
            {
                switch (c)
                {
                    case '\r':
                    case '\n':
                    case '\t':
                    case ' ':
                        if (!inBlanks)
                        {
                            inBlanks = true;
                            sb.Append(' ');
                        }   
                        continue;
                    default:
                        inBlanks = false;
                        sb.Append(c);
                        break;
                }
            }
            return sb.ToString().Trim();
        }

var noHtml = Regex.Replace(inputHTML, @"<[^>]*(>|$)|&nbsp;|&zwnj;|&raquo;|&laquo;", string.Empty).Trim();

I have used the @RaviThapliyal & @Don Rolling's code but made a little modification. Since we are replacing the &nbsp with empty string but instead &nbsp should be replaced with space, so added an additional step. It worked for me like a charm.

public static string FormatString(string value) {
    var step1 = Regex.Replace(value, @"<[^>]+>", "").Trim();
    var step2 = Regex.Replace(step1, @"&nbsp;", " ");
    var step3 = Regex.Replace(step2, @"\s{2,}", " ");
    return step3;
}

Used &nbps without semicolon because it was getting formatted by the Stack Overflow.

this:

(<.+?> | &nbsp;)

will match any tag or  

string regex = @"(<.+?>|&nbsp;)";
var x = Regex.Replace(originalString, regex, "").Trim();

then x = hello

Sanitizing an Html document involves a lot of tricky things. This package maybe of help: https://github.com/mganss/HtmlSanitizer

HTML is in its basic form just XML. You could Parse your text in an XmlDocument object, and on the root element call InnerText to extract the text. This will strip all HTML tages in any form and also deal with special characters like <   all in one go.

(<([^>]+)>|&nbsp;)

You can test it here: https://regex101.com/r/kB0rQ4/1

참고URL : https://stackoverflow.com/questions/19523913/remove-html-tags-from-string-including-nbsp-in-c-sharp

'developer tip' 카테고리의 다른 글

Spark Scala에서 DataFrame의 열 이름 이름 바꾸기 (0)	2020.09.23
Google Maps API의 "내 위치"버튼 위치 변경 (0)	2020.09.23
줄 바꿈으로 구분 된 파일을 읽고 줄 바꿈을 버리는 가장 좋은 방법은 무엇입니까? (0)	2020.09.23
Backbone.js에서보기 삭제 또는 제거 (0)	2020.09.23
ON 조건없이 mysql JOIN을 사용하는 방법은 무엇입니까? (0)	2020.09.22

현재글C #을 포함한 문자열에서 HTML 태그 제거

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

optionbox

C #을 포함한 문자열에서 HTML 태그 제거

C #을 포함한 문자열에서 HTML 태그 제거

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

C #을 포함한 문자열에서 HTML 태그 제거

C #을 포함한 문자열에서 HTML 태그 제거

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역