|
在使用 HttpWebRequest HttpWebResponse 二个虚拟类进行多线程序获取网页文件时,得到的是网页源码。这就存在一个“从网页源码中分离文本”的工作。
下面的函数就是从网页源码中分离文本的一种算法,实用效果还行。不知道还有没有更好的算法?
Private Function GetHtmlText(ByVal HTML As String) As String 'HTML = StrConv(HTML, VbStrConv.SimplifiedChinese) HTML = HTML.Replace(" ", "") Dim temp As String = String.Empty Dim HtmlText As String = String.Empty Dim i As Integer = 0 Dim j As Integer = 0 Dim k As Integer = HTML.IndexOf("<body") Do i = HTML.IndexOf(">", k) If i > 1 Then j = HTML.IndexOf("<", i + 1) If j > 1 Then temp = temp.Substring(i + 1, j - i - 1) If temp.Length > 0 Then HtmlText = HtmlText & temp End If Else Exit Do End If Else Exit Do End If k = j Loop Return HtmlText End Function
|