Pricing

Overcoming Proxy Restrictions for Web Crawlers in Python

In this article, we explore how to set up a ProxyHandler proxy for urllib2 and modify the code to overcome these restrictions, allowing us to successfully download images from a forum and save them locally.

Those who have worked on web crawlers would know that Python's urllib2 is very convenient to use. With just a few lines of code, you can easily obtain the source code of a website:

#coding=utf-8
import urllib
import urllib2
import re

url = "http://wetest.qq.com"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()
print html

 

Finally, you can get the desired information by using certain regular expressions to match and parse the returned response content.

However, this method may not work for some external websites in office and development networks.

For example, when trying to access http://tieba.baidu.com/p/2460150866, a 10060 error code is reported, indicating a connection failure.

#coding=utf-8
import urllib
import urllib2
import re

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()
print html

 

After execution, the error message screenshot is as follows:

To analyze the cause of this issue, the following steps were taken:

1. Entering the URL in a browser can open the website normally, indicating that the site is accessible.

2. Running the same script on the company's experience network works fine, indicating that there is no problem with the script itself.

Based on these two steps, it is preliminarily determined that the issue is caused by the company's access policy restrictions for external websites. Therefore, I looked up how to set a ProxyHandler proxy for urllib2 and modified the code as follows:

#coding=utf-8
import urllib
import urllib2
import re

# The proxy address and port:
proxy_info = { 'host' : 'web-proxy.oa.com','port' : 8080 }

# We create a handler for the proxy
proxy_support = urllib2.ProxyHandler({"http" : "http://%(host)s:%(port)d" % proxy_info})

# We create an opener which uses this handler:
opener = urllib2.build_opener(proxy_support)

# Then we install this opener as the default opener for urllib2:
urllib2.install_opener(opener)

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()
print html

 

After running the modified code, the desired HTML page can be obtained.

Is it over now? Not yet! The goal is to obtain various beautiful images from the forum and save them locally. Let's move on to the code:

```python
#coding=utf-8
import urllib
import urllib2
import re

# The proxy address and port:
proxy_info = { 'host' : 'web-proxy.oa.com','port' : 8080 }

# We create a handler for the proxy
proxy_support = urllib2.ProxyHandler({"http" : "http://%(host)s:%(port)d" % proxy_info})

# We create an opener which uses this handler:
opener = urllib2.build_opener(proxy_support)

# Then we install this opener as the default opener for urllib2:
urllib2.install_opener(opener)

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()

# Regular expression matching
reg = r'src="(.+?\.jpg)" pic_ext'
imgre = re.compile(reg)
imglist = re.findall(imgre,html)
print 'start dowload pic'
x = 0
for imgurl in imglist:
    urllib.urlretrieve(imgurl,'pic\\%s.jpg' % x)
    x = x+1
```

 

After running the code again, an error still occurs! It's the 10060 error again. I've set the proxy for urllib2, so why is there still an error?

So, I continued to find a solution, determined to obtain various beautiful images from the forum. Since regular expressions can be used to obtain the URLs of images in the forum, why not manually call urllib2.urlopen to open the corresponding URL, obtain the corresponding response, then read the binary data of the image, and finally save the image to a local file? This led to the following code:

```python
#coding=utf-8
import urllib
import urllib2
import re

# The proxy address and port:
proxy_info = { 'host' : 'web-proxy.oa.com','port' : 8080 }

# We create a handler for the proxy
proxy_support = urllib2.ProxyHandler({"http" : "http://%(host)s:%(port)d" % proxy_info})

# We create an opener which uses this handler:
opener = urllib2.build_opener(proxy_support)

# Then we install this opener as the default opener for urllib2:
urllib2.install_opener(opener)

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()

# Regular expression matching
reg = r'src="(.+?\.jpg)" pic_ext'
imgre = re.compile(reg)
imglist = re.findall(imgre,html)

x = 0
print 'start'
for imgurl in imglist:
    print imgurl
    resp = urllib2.urlopen(imgurl)
    respHtml = resp.read()
    picFile = open('%s.jpg' % x, "wb")
    picFile.write(respHtml)
    picFile.close()
    x = x+1
print 'done'
```

 

After running the code again, it was found that the image URLs were printed as expected, and the images were also saved.

At this point, the original goal has been achieved. I hope the summarized content is also useful for other friends.

Latest Posts
1Optimizing Your App's Network Performance with WeTest's Local App Network Testing Solutions LEARN HOW TO OPTIMIZE YOUR APP'S NETWORK PERFORMANCE USING WETEST'S LOCAL APP NETWORK TESTING SOLUTIONS FOR A SEAMLESS USER EXPERIENCE.
2Comprehensive LambdaTest Alternative: WeTest for Game Testing As a lambdatest alternative, WeTest is an advanced and effective solution for game testers & developers with features of cross-browser and cross-device compatibility testing.
3The 5 Must-Do Tests for a Game that Goes for the Global Market Learn about the 5 most important tests during game localization.
4How to Get into QA Game Testing: Comprehensive Guide How to get into qa game testing? In this guide, you will see how to find bugs and determine the peculiarities of game testing, essential approaches, and recommendations.
5How to Test a Game for a Global Audience Try out WeTest's overseas local user testing services to expand your global market.