Overcoming Proxy Restrictions for Web Crawlers in Python

Testing Tech Hub 2023-10-31 12:28 280

In this article, we explore how to set up a ProxyHandler proxy for urllib2 and modify the code to overcome these restrictions, allowing us to successfully download images from a forum and save them locally.

Those who have worked on web crawlers would know that Python's urllib2 is very convenient to use. With just a few lines of code, you can easily obtain the source code of a website:

#coding=utf-8
import urllib
import urllib2
import re

url = "http://wetest.qq.com"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()
print html

Finally, you can get the desired information by using certain regular expressions to match and parse the returned response content.

However, this method may not work for some external websites in office and development networks.

For example, when trying to access http://tieba.baidu.com/p/2460150866, a 10060 error code is reported, indicating a connection failure.

#coding=utf-8
import urllib
import urllib2
import re

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()
print html

After execution, the error message screenshot is as follows:

To analyze the cause of this issue, the following steps were taken:

1. Entering the URL in a browser can open the website normally, indicating that the site is accessible.

2. Running the same script on the company's experience network works fine, indicating that there is no problem with the script itself.

Based on these two steps, it is preliminarily determined that the issue is caused by the company's access policy restrictions for external websites. Therefore, I looked up how to set a ProxyHandler proxy for urllib2 and modified the code as follows:

#coding=utf-8
import urllib
import urllib2
import re

# The proxy address and port:
proxy_info = { 'host' : 'web-proxy.oa.com','port' : 8080 }

# We create a handler for the proxy
proxy_support = urllib2.ProxyHandler({"http" : "http://%(host)s:%(port)d" % proxy_info})

# We create an opener which uses this handler:
opener = urllib2.build_opener(proxy_support)

# Then we install this opener as the default opener for urllib2:
urllib2.install_opener(opener)

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()
print html

After running the modified code, the desired HTML page can be obtained.

Is it over now? Not yet! The goal is to obtain various beautiful images from the forum and save them locally. Let's move on to the code:

```python
#coding=utf-8
import urllib
import urllib2
import re

# The proxy address and port:
proxy_info = { 'host' : 'web-proxy.oa.com','port' : 8080 }

# We create a handler for the proxy
proxy_support = urllib2.ProxyHandler({"http" : "http://%(host)s:%(port)d" % proxy_info})

# We create an opener which uses this handler:
opener = urllib2.build_opener(proxy_support)

# Then we install this opener as the default opener for urllib2:
urllib2.install_opener(opener)

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()

# Regular expression matching
reg = r'src="(.+?\.jpg)" pic_ext'
imgre = re.compile(reg)
imglist = re.findall(imgre,html)
print 'start dowload pic'
x = 0
for imgurl in imglist:
urllib.urlretrieve(imgurl,'pic\\%s.jpg' % x)
x = x+1
```

After running the code again, an error still occurs! It's the 10060 error again. I've set the proxy for urllib2, so why is there still an error?

So, I continued to find a solution, determined to obtain various beautiful images from the forum. Since regular expressions can be used to obtain the URLs of images in the forum, why not manually call urllib2.urlopen to open the corresponding URL, obtain the corresponding response, then read the binary data of the image, and finally save the image to a local file? This led to the following code:

```python
#coding=utf-8
import urllib
import urllib2
import re

# The proxy address and port:
proxy_info = { 'host' : 'web-proxy.oa.com','port' : 8080 }

# We create a handler for the proxy
proxy_support = urllib2.ProxyHandler({"http" : "http://%(host)s:%(port)d" % proxy_info})

# We create an opener which uses this handler:
opener = urllib2.build_opener(proxy_support)

# Then we install this opener as the default opener for urllib2:
urllib2.install_opener(opener)

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()

# Regular expression matching
reg = r'src="(.+?\.jpg)" pic_ext'
imgre = re.compile(reg)
imglist = re.findall(imgre,html)

x = 0
print 'start'
for imgurl in imglist:
print imgurl
resp = urllib2.urlopen(imgurl)
respHtml = resp.read()
picFile = open('%s.jpg' % x, "wb")
picFile.write(respHtml)
picFile.close()
x = x+1
print 'done'
```

After running the code again, it was found that the image URLs were printed as expected, and the images were also saved.

At this point, the original goal has been achieved. I hope the summarized content is also useful for other friends.

python

Read Previous Post >>

Code Refactoring in Practice: Enhancing Maintainability and Scalability

Overcoming Proxy Restrictions for Web Crawlers in Python

Related Content