1, XPath analysis

Xpath: is a language for finding information in XMl and html documents. It uses the lxml library to retrieve data from HTML parsing.

Common rules for XPath:

Nodename : select all child nodes of this node
// : Select descendant nodes from the current node
/ : Select a child node from the current node
  . : Select the current node
  .. : Select the current node parent node
@ : Select attribute
1. Initialize html

Etree.parse() initializes html to construct an XPath parsing object;

etree.tostring() fixes the code in the html file and

fills
the missing header or tail node;
result.deode(‘utf-8’) fixed HTML The code is a byte type that is converted to a string;

from lxml import etreed

html=etree.parse('c:/.../test.html',etree.HTMLParser())
result=etree.tostring(html)
result.decode('utf-8')
2. Get all nodes

XPath rules generally start with //

example:
Html.xpath('//*') //Get all the nodes
Html.xpath('//li') //Get all li nodes
3. Child nodes, descendant nodes
Html.xpath('//li/a') //All li are all direct a child nodes
Html.xpath('//ul//a') //All descendants a node under ul
Parent node
Html.xpath('//a[@href="links.html"]/../@class')
  / / Find the class value of the parent node whose href is links.html in all a nodes
// .. to achieve the lookup of the parent node
5. Attribute matching
Html.xpath('//li[@class="item-0"]') //Find the class value as item-0 is the node

6. Text acquisition
html.xpath('//li[@class="item-0"]/a/text()')
or html.xpath('//li[@class="item-0"]//text()')
7. Property acquisition
Html.xpath('//li/a/@href') //Find the value of the href attribute in a under li
8. Attribute multi-value matching
Html.xpath('//li[contains(@class,"li")]/a/text()')
/ / As long as the node attribute class contains li can match
9. Multi-attribute matching
Html.xpath('//li[contains(@class,"li") and @name="item"]/a/text()')
/ / Match the node attribute class value is li, the name is the node of the item

2, Beautiful Soup analysis

Beautiful Soup is an HTML or XML parsing library.
Provide the user with the data they need to crawl by parsing the document.

Need parser: lxml HTML parsing library, lxml XML parsing library, Python standard library, html5lib

Basic usage:

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
1.soup.prettify()

Call the prettify() method to output the string to be parsed in standard indented format

2. Node selector:

Example: soup.title.string

(1) Select elements:
  Soup.title, soup.title.sring soup.head soup.p
(2) Extract information:
  1) Get the node name: soup.title.name
  2) Get the attribute: soup.p.attrs, soup.p.attrs['name']
  3) Get the content: soup.p.string
(3) Nested selection:
Soup.head.title.string
(4) Association selection:
  enumerate() //builder type
(1) soup.p.contents //direct child list under p node
(2) soup.p.descendants // all descendant nodes under the p node (generator type)
(3) parent node and ancestor node: soup.p.partent, soup.p.parents
(4) Brother node:
    Soup.a.next_sibling
    Soup.a.previous_sibling
    Soup.a.next_siblings
    Enumerate(soup.a.pervious_siblings)
(5) Extract information: soup.a.next_sibling.string
3. Method selector:

Find_all(name,attrs,recursive,text,**kwargs)

(1) name: find_all(name='li')
(2) attrs:find_all(attrs={'id':'list-1'}), find_all(class_='element')
(3) text: match the text of the node, find_all (text = 'string or regular expression')

Special: find() usage is the same as fand_all, except that it only matches the first element

4.css selector

Call the select() method, passing in the corresponding css selector

Soup.select('.panel.panel-heading')
Soup.select('ul li') //All li under ul
Soup.select('#list-2.element')
(1) Nested selection:
For ul in soup.select('ul'):
     Ul.select('li')
(2) Get attributes:
For ul in soup.select('ul')
     Ul.attrs['id']
     Ul['id']
(3) Get the text:
For li in soup.select('li'):
     Li.get_text()
     Lli.string

3, pyquery analysis

1. Initialization:

(1) String initialization:

html=''' *******
     '''
from pyquery import PyQuery as pq
doc=pq(html)
print(doc('li'))

(2) URL initialization

doc=pq(url=” https:/ … “)

(3) file initialization

doc=pq(filename=’demo.html’)

print(doc(li))

2. Basic css selector
Doc('#container .list li')
//id is container, class is all li under list
3. Find the node

(1) descendant nodes, child nodes
.find(): find all descendant nodes
Items=doc(‘.list’)
Items.find(‘li’)

.children(): Find child nodes
Items=doc(‘.list’)
Items.children(‘.active’)

(2) parent node
Doc=pq(html)
Items=doc(‘.list’)
Items.parent()
Ancestor node
Items.parents()

(3) Brother node
Doc=pq(html)
Li=doc(‘.lsit .item-0.active’)
Li.siblings(‘.active’)

4. Traversing

Use the items() function to generate a list generator for traversal

doc=pq(html)
lis=doc('li').items()
for li in lis:
  print(li)
5. Get information

(1) Get attributes
a=doc(‘.item-0.active a’)
Print(a.attr(‘href’)) or print(a.attr.href)
Special: attr only outputs the first a node attribute, traversing with items()

(2) Get text
.text()
a=doc(‘.item-0.active a’)
A.text() //text() function will output all the li text content

.html()
Li=doc(‘li’)
Li.html() //html() will only output the HTML text in the first li node

6. Node operation

(1) removeClass addClass
Li=doc(‘.item-0.active’)
Print(li)
li.removeClass(‘active’) //Remove the active class
li.addClass(‘active’) //Add active class

(2) attr text html
Li.attr(‘name’,’link’) //Add attribute name=link
Li.text(‘changed item’) //Change the text changed item
Li.html(<span>changed item </span>) //Change HTML

(3) remove()
Wrap=doc(‘.wrap’)
Wrap.find(‘p’).remove() //Delete p node in wrap
Wrap.text()


Special: pseudo-class selector

Select the first node, the last node, the parity node, and include a text node