What is a web crawler?

The so-called crawler is a program or script that automatically grabs information from the network according to certain rules. The World Wide Web is like a giant spider web. Our web crawler is a spider above, constantly crawling the information we need.

What is Python?

Python is a computer programming language. A dynamic, object-oriented scripting language originally designed to write automated scripts (shells) that are increasingly being used for stand-alone, large-scale projects as versions continue to be updated and new language features are added. Development, especially the development of web crawlers, is very simple, fast and effective.

Three elements of crawler

  1. crawler grab
  2. crawler analysis
  3. crawler storage

Basic crawler grab operation

1, urllib

In Python 2.x we can use urllib or urllib2 for web crawling, but Python3.x removes urllib2. Can only be operated by urllib

import urllib.request
 
response = urllib.request.urlopen('http://www.codecoder.top/tag/crawler')
print(response.read().decode('utf-8'))

Urllib with parameters

url = 'https://blog.csdn.net/weixin_43499626'
url = url + '?' + key + '=' + value1 + '&' + key2 + '=' + value2

2, requests

The requests library is a very useful HTPP client library and is the most commonly used library for crawl operations. The Requests library meets many needs

import requests
# crawler get request
response = requests.get(url='http://') 
print(response.text)   #print crawler response
# crawler requests get with parameters
response = requests.get(url='http://www.codecoder.top/tag/crawler', params={'key1':'value1', 'key2':'value2'})

How a web crawler login a website?

1. Web crawler login by the form submit

Form submit can send a post request to the server and carries the relevant parameters, and saves the cookie returned by the server locally. The cookie is the “monitor” of the server on the client, and the login information is recorded. The client determines whether to log in by identifying the cookie carried in the request.

params = {'username': 'root', 'passwd': 'root'}
response = requests.post("hhttp://www.codecoder.top/login", data=params)
for key,value in response.cookies.items():
    print('key = ', key + ' ||| value :'+ value)

2, Web crawler login a website with cookie

we can store the login cookie in the file,

import urllib.request
import http.cookiejar
"""
save login cookie for crawler
"""
"""
MozillaCookieJar : cookiejar subclass
from FileCookieJar,create compatible of Mozilla browser's cookies.txt for crawler FileCookieJar instance。
"""
cookie = http.cookiejar.MozillaCookieJar('cookie.txt')
# crate a  cookie handler for crawler
handler = urllib.request.HTTPCookieProcessor(cookie)
# get a opener object for crawler
opener = urllib.request.build_opener(handler)
# # get a request object for crawler
request = urllib.request.Request('http://www.codecoder.top.com/tag/crawler',headers={"Connection": "keep-alive"})
# request server and get response object  for crawler, and cookie got from the response
response = opener.open(request)
# save cookie to file for crawler later
cookie.save(ignore_discard=True, ignore_expires=True)
 
 
"""
request with file of cookie
"""
 
import urllib.request
import http.cookiejar
cookie = http.cookiejar.MozillaCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
request = urllib.request.Request('http://www.codecoder.top/tag/crawler/')
html = opener.open(request).read().decode('utf8')
 
print(html)

How to deal with anti web crawler restrictions?

1. Restriction of user-agent to control access right of web crawler

Through the user-agent enables the server to identify the user’s operating system and version, cpu type, browser type, and version. Many websites will set up a user-agent whitelist, and only requests within the whitelist can be accessed normally. So in our crawler code we need to set the user-agent to pretend to be a browser request. Sometimes the server may also verify the Referer, so you may also need to set the Referer (used to indicate from which page the request was linked)

# set request user-agent header of the web crawler
headers = {
        'Host': 'http://www.codecoder.top',
        'Referer': 'http://www.codecoder.top/tag/crawler/',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }
response = requests.get("http://www.google.com", headers=headers)

The following is a sample information in the Request Header.

accept: */*
accept-encoding: gzip, deflate, br
accept-language: zh-CN,zh;q=0.9
content-length: 0
cookie: bdshare_firstime=1500xxxxxxxx..............
origin: http://www.codecoder.top
referer: http://www.codecoder.top/tag/crawler/
user-agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36
x-requested-with: XMLHttpRequest

2, Restriction web crawler by IP to limit web crawler actions

When we access the server multiple times with the same ip, the server will detect that the request may be a crawler operation. Therefore, the information of the page cannot be normally responded to.
The solution is usually to use an IP proxy pool. There are many websites that provide agents on the Internet.

proxies = {
  "http": "http://119.101.125.56",
  "https": "http://119.101.125.1",
}
response = requests.get("http://www.google.com", proxies=random.choices(proxies))

3, Restriction web crawler by request interval to limit web crawler actions

Under this condition, your web crawler must not request too often, you need to set a interval time betwwen your crawl actions.

import time
time.sleep(1)

4.Use automated web appication testing tools as a crawler

Use automated web appication testing tools such as Selenium tools as a web crawler. This tool can be used for unit testing, integration testing, system testing, and more. It can operate the browser like a real user (including character padding, mouse clicks, get elements, page switching), and supports Mozilla Firefox, Google, Chrome, Safari, Opera, IE and more.

5. Restriction web crawler by parameters encrypted to limit web crawler

Certain websites, or the parameters may be encrypted ​​and sent to the server to achieve anti-crawling purposes. At this time we can try to see the way to crack through the js code. The encrypt function are in js code, just spend sometime to find it.
Or you can use “PhantomJS”, a Webkit-based “headless” browser that loads the website into memory and executes the JavaScript on the page, because it doesn’t display the graphical interface, so it runs better than A full browser is more efficient.

6, Restriction web crawler through robots.txt

Robots.txt is a specification for restricting crawlers, which is used to declare what can’t be crawled. If the file exists in the root directory, the crawler will crawl the specified range according to the contents of the file.

A sample contents of a robots.txt file as follows:

User- agent: Baiduspider
Disallow:   /product/ 
Disallow:   /

User - Agent: Googlebot
Disallow:   /

User - agent: Bingbot
Disallow:   /

User - Agent: Yahoo! Slurp
Disallow:   /

User -Agent: * 
Disallow:   /

It can be seen that this robots.txt rejected the constraints of Baidu crawler, Google crawler, Bing crawler,  and Yahoo crawler.

In this case, you can only crawl pages that are allowed in the robots.txt file, otherwise there may be legal risks.

Web crawler analysis

We can analyze the content of the crawled webpage and get the data we really need. There are regular expressions, BeautifulSoup, XPath, lxml, etc.

Regular expressions are content matching, all content that meets the requirements are obtained;
xpath() can convert the string into a label, it will detect whether the string content is a label, but can not detect whether the content is true or not;
Beautifulsoup is A third-party library of Python, which functions like xpath, is used to parse html data. In contrast, xpath is faster, because the underlying xpath is implemented with c.

Web crawler storage

By analyzing the content of the webpage and obtaining the data we want, we can choose to save it to a text file or store it in a database. The commonly used databases are MySql and MongoDB.

Stored crawled data as json file

import json
 
dictObj = {
    'Jack':{
        'age': 15,
        'city': 'DC',
    },
    'Tom': {
        'age': 16,
        'city': 'NYK',
    }
}
 
jsObj = json.dumps(dictObj, ensure_ascii=False)
fileObject = open('jsonFile.json', 'w')
fileObject.write(jsObj)
fileObject.close()

Stored crawled data as cvs file

import csv
with open('student.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['Name', 'Age', 'City'])
    writer.writerows([['Jack', 15 , 'DC'],['Tom', 16, 'NYK']])

Store crawled data to Mongo

# mongo service
client = pymongo.MongoClient('mongodb://127.0.0.1:27017/')
# test database
db = client.test
# student table 
student_db = db.student
student_json = {
    'name': 'Jack',
    'age': 15,
    'city': 'DC'
}
student_db.insert(student_json)

Conclusion

The above is just a quick tutorial and examples, there are some example code, but it has covered a variety of technologies that web crawlers need to master and various methods to deal with anti-web-crawler. What you need to do is to master the various technologies mentioned above. With a lot of practice, you can naturally master the implementation of web crawler to automate get the information we need. As for the specific implementation of the python language, you need to go to the python tutorial.