The IDE is Visual Studio 2013, Search in NugGet and added it to the project.

Jumony library’s usage

1. Obtain the html code from the website and analyze the html string into a standard Document Object Model (DOM).

IHtmlDocument source =newJumonyParser().LoadDocument("", System.Text.Encoding.GetEncoding("utf-8"));

Jumony’s API can directly retrieve document analysis from the Internet and automatically recognize the encoding according to the HTTP header, but the above website can’t get the html, the other websites are fine (such as blog garden, starting point), then I download the source code. Down, step by step test, found that html is obtained, but garbled, resulting in the Jumony class library analysis of html text, the analysis is not correct.The solution is to set utf-8.

2, get all the meta tags

VaraLinks = source.Find("meta");//Get all meta tagsforeach(varaLinkinaLinks){if(aLink.Attribute("name").Value() =="keywords"){Name= aLink.Attribute("content").Value();//Noborders, no chapters in Xinjiang, full text reading}}

3. Get the meta tag of name=keywords and get the value in the content attribute.

stringname = source.Find("meta[name=keywords]").FirstOrDefault().Attribute("content").Value();

4, get all tags with Class = L

VarlLinks = source.Find(".L");//Get all class=L td tagsforeach(varlLinkinlLinks)//loop class=L td{//lLink值 例如:<td class="L"><a href="">楔子</a></td>}VaraLinks = source.Find(".L a");//Get all a tags for class=Lforeach(varaLinkinaLinks){//aLink值 <a href="">楔子</a>stringtitle = aLink.InnerText()//楔子stringurl = aLink.Attribute("href").Value();//}

5, get tags according to ID

VarchapterLink = source.Find("#at a");//Look for all a tags under id=at forforeach(variinchapterLink)//here is the a tag{//aLink value for example: <a href ="">Wedge</a>stringtitle = i.InnerText();//Wedgestringurl = i.Attribute ("href").Value();//}

C# complete code

{publicclassCrawlerController : BaseController
{//GET: CrawlerpublicvoidIndex()
{//Need to give utf-8 encoding, otherwise html is garbled.
IHtmlDocument source =newJumonyParser().LoadDocument("", 
  System.Text.Encoding.GetEncoding("utf-8") );
//<Meta name = "keywords" Content = "aaa, bbb,ccc" />
String name = source.Find ("Meta [name = keywords]") .FirstOrDefault (). The Attribute ("Content").Value().Split(',')[0];
//Get the articles' name 
var chapterLink = source.Find("#at a");
//Look for all a tagsforeachunder id=at(variInchapterLink)
//The loop here is the a tag
{//chapter title
string title =i.InnerText ();
stringurl = i.Attribute("href").Value();
//According to the url of the article, get the html IHtmlDocument of the article
pagesourceChild =newJumonyParser().LoadDocument(url, System.Text.Encoding.GetEncoding("utf-8"));
//Find the body content of the article under id=contents
stringcontent = sourceChild.Find("#contents").FirstOrDefault().InnerHtml().Replace(" ","").Replace("<br />","\r\n");
//txt output
string path = AppDomain.CurrentDomain.BaseDirectory.Replace("\\","/") +"Txt/";
AddArticle(title+"\r\n"+content, name, path);

Jumony source code address: