Pricing

Overcoming Proxy Restrictions for Web Crawlers in Python

In this article, we explore how to set up a ProxyHandler proxy for urllib2 and modify the code to overcome these restrictions, allowing us to successfully download images from a forum and save them locally.

Those who have worked on web crawlers would know that Python's urllib2 is very convenient to use. With just a few lines of code, you can easily obtain the source code of a website:

#coding=utf-8
import urllib
import urllib2
import re

url = "http://wetest.qq.com"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()
print html

 

Finally, you can get the desired information by using certain regular expressions to match and parse the returned response content.

However, this method may not work for some external websites in office and development networks.

For example, when trying to access http://tieba.baidu.com/p/2460150866, a 10060 error code is reported, indicating a connection failure.

#coding=utf-8
import urllib
import urllib2
import re

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()
print html

 

After execution, the error message screenshot is as follows:

To analyze the cause of this issue, the following steps were taken:

1. Entering the URL in a browser can open the website normally, indicating that the site is accessible.

2. Running the same script on the company's experience network works fine, indicating that there is no problem with the script itself.

Based on these two steps, it is preliminarily determined that the issue is caused by the company's access policy restrictions for external websites. Therefore, I looked up how to set a ProxyHandler proxy for urllib2 and modified the code as follows:

#coding=utf-8
import urllib
import urllib2
import re

# The proxy address and port:
proxy_info = { 'host' : 'web-proxy.oa.com','port' : 8080 }

# We create a handler for the proxy
proxy_support = urllib2.ProxyHandler({"http" : "http://%(host)s:%(port)d" % proxy_info})

# We create an opener which uses this handler:
opener = urllib2.build_opener(proxy_support)

# Then we install this opener as the default opener for urllib2:
urllib2.install_opener(opener)

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()
print html

 

After running the modified code, the desired HTML page can be obtained.

Is it over now? Not yet! The goal is to obtain various beautiful images from the forum and save them locally. Let's move on to the code:

```python
#coding=utf-8
import urllib
import urllib2
import re

# The proxy address and port:
proxy_info = { 'host' : 'web-proxy.oa.com','port' : 8080 }

# We create a handler for the proxy
proxy_support = urllib2.ProxyHandler({"http" : "http://%(host)s:%(port)d" % proxy_info})

# We create an opener which uses this handler:
opener = urllib2.build_opener(proxy_support)

# Then we install this opener as the default opener for urllib2:
urllib2.install_opener(opener)

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()

# Regular expression matching
reg = r'src="(.+?\.jpg)" pic_ext'
imgre = re.compile(reg)
imglist = re.findall(imgre,html)
print 'start dowload pic'
x = 0
for imgurl in imglist:
    urllib.urlretrieve(imgurl,'pic\\%s.jpg' % x)
    x = x+1
```

 

After running the code again, an error still occurs! It's the 10060 error again. I've set the proxy for urllib2, so why is there still an error?

So, I continued to find a solution, determined to obtain various beautiful images from the forum. Since regular expressions can be used to obtain the URLs of images in the forum, why not manually call urllib2.urlopen to open the corresponding URL, obtain the corresponding response, then read the binary data of the image, and finally save the image to a local file? This led to the following code:

```python
#coding=utf-8
import urllib
import urllib2
import re

# The proxy address and port:
proxy_info = { 'host' : 'web-proxy.oa.com','port' : 8080 }

# We create a handler for the proxy
proxy_support = urllib2.ProxyHandler({"http" : "http://%(host)s:%(port)d" % proxy_info})

# We create an opener which uses this handler:
opener = urllib2.build_opener(proxy_support)

# Then we install this opener as the default opener for urllib2:
urllib2.install_opener(opener)

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()

# Regular expression matching
reg = r'src="(.+?\.jpg)" pic_ext'
imgre = re.compile(reg)
imglist = re.findall(imgre,html)

x = 0
print 'start'
for imgurl in imglist:
    print imgurl
    resp = urllib2.urlopen(imgurl)
    respHtml = resp.read()
    picFile = open('%s.jpg' % x, "wb")
    picFile.write(respHtml)
    picFile.close()
    x = x+1
print 'done'
```

 

After running the code again, it was found that the image URLs were printed as expected, and the images were also saved.

At this point, the original goal has been achieved. I hope the summarized content is also useful for other friends.

Latest Posts
1How To Check Game Compatibility On PC? | Extensive Overview How to check game compatibility on pc? To provide good gameplay, it is important to detect the critical factors of game compatibility between diverse PC setups, hardware, and software
2Xbox Game Beta Testing | Comprehensive Review Carrying a thorough xbox game beta testing before launch is an important step to track down and resolve errors, enhance the gaming experience, and make high-quality games.
3Don't Miss Out! Get Your Free 60-Minute PerfDog Trial with 2024 PerfDog WhitePaper DOWNLOAD THE 2024 PERFDOG WHITEPAPER AND EARN A 60-MINUTE FREE TRIAL OF PERFDOG EVO V10.2!
4PerfDog EVO v10.0 Shatters the Barriers of Game and App Performance Testing In PerfDog EVO v10.0 version, we have made significant optimizations from three perspectives to meet users’ performance testing requirements in different scenarios.
5Overcoming Cloud Real Device Challenges: WeTest’s Exclusive Solution for Lagging and Access Restrictions Public cloud technology has met the testing needs of numerous small and micro-enterprises as well as individuals. However, as customers delve deeper into usage, they encounter a range of new issues. In this article, we will provide answers to several common questions.