Customer Cases
Pricing

Overcoming Proxy Restrictions for Web Crawlers in Python

In this article, we explore how to set up a ProxyHandler proxy for urllib2 and modify the code to overcome these restrictions, allowing us to successfully download images from a forum and save them locally.

Those who have worked on web crawlers would know that Python's urllib2 is very convenient to use. With just a few lines of code, you can easily obtain the source code of a website:

#coding=utf-8
import urllib
import urllib2
import re

url = "http://wetest.qq.com"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()
print html

 

Finally, you can get the desired information by using certain regular expressions to match and parse the returned response content.

However, this method may not work for some external websites in office and development networks.

For example, when trying to access http://tieba.baidu.com/p/2460150866, a 10060 error code is reported, indicating a connection failure.

#coding=utf-8
import urllib
import urllib2
import re

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()
print html

 

After execution, the error message screenshot is as follows:

To analyze the cause of this issue, the following steps were taken:

1. Entering the URL in a browser can open the website normally, indicating that the site is accessible.

2. Running the same script on the company's experience network works fine, indicating that there is no problem with the script itself.

Based on these two steps, it is preliminarily determined that the issue is caused by the company's access policy restrictions for external websites. Therefore, I looked up how to set a ProxyHandler proxy for urllib2 and modified the code as follows:

#coding=utf-8
import urllib
import urllib2
import re

# The proxy address and port:
proxy_info = { 'host' : 'web-proxy.oa.com','port' : 8080 }

# We create a handler for the proxy
proxy_support = urllib2.ProxyHandler({"http" : "http://%(host)s:%(port)d" % proxy_info})

# We create an opener which uses this handler:
opener = urllib2.build_opener(proxy_support)

# Then we install this opener as the default opener for urllib2:
urllib2.install_opener(opener)

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()
print html

 

After running the modified code, the desired HTML page can be obtained.

Is it over now? Not yet! The goal is to obtain various beautiful images from the forum and save them locally. Let's move on to the code:

```python
#coding=utf-8
import urllib
import urllib2
import re

# The proxy address and port:
proxy_info = { 'host' : 'web-proxy.oa.com','port' : 8080 }

# We create a handler for the proxy
proxy_support = urllib2.ProxyHandler({"http" : "http://%(host)s:%(port)d" % proxy_info})

# We create an opener which uses this handler:
opener = urllib2.build_opener(proxy_support)

# Then we install this opener as the default opener for urllib2:
urllib2.install_opener(opener)

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()

# Regular expression matching
reg = r'src="(.+?\.jpg)" pic_ext'
imgre = re.compile(reg)
imglist = re.findall(imgre,html)
print 'start dowload pic'
x = 0
for imgurl in imglist:
    urllib.urlretrieve(imgurl,'pic\\%s.jpg' % x)
    x = x+1
```

 

After running the code again, an error still occurs! It's the 10060 error again. I've set the proxy for urllib2, so why is there still an error?

So, I continued to find a solution, determined to obtain various beautiful images from the forum. Since regular expressions can be used to obtain the URLs of images in the forum, why not manually call urllib2.urlopen to open the corresponding URL, obtain the corresponding response, then read the binary data of the image, and finally save the image to a local file? This led to the following code:

```python
#coding=utf-8
import urllib
import urllib2
import re

# The proxy address and port:
proxy_info = { 'host' : 'web-proxy.oa.com','port' : 8080 }

# We create a handler for the proxy
proxy_support = urllib2.ProxyHandler({"http" : "http://%(host)s:%(port)d" % proxy_info})

# We create an opener which uses this handler:
opener = urllib2.build_opener(proxy_support)

# Then we install this opener as the default opener for urllib2:
urllib2.install_opener(opener)

url = "http://tieba.baidu.com/p/2460150866"
request = urllib2.Request(url)
page = urllib2.urlopen(url)
html = page.read()

# Regular expression matching
reg = r'src="(.+?\.jpg)" pic_ext'
imgre = re.compile(reg)
imglist = re.findall(imgre,html)

x = 0
print 'start'
for imgurl in imglist:
    print imgurl
    resp = urllib2.urlopen(imgurl)
    respHtml = resp.read()
    picFile = open('%s.jpg' % x, "wb")
    picFile.write(respHtml)
    picFile.close()
    x = x+1
print 'done'
```

 

After running the code again, it was found that the image URLs were printed as expected, and the images were also saved.

At this point, the original goal has been achieved. I hope the summarized content is also useful for other friends.

Latest Posts
1How to Test AI Products: A Complete Guide to Evaluating LLMs, Agents, RAG, and Computer Vision Models A comprehensive guide to AI product testing covering binary classification, object detection, LLM evaluation, RAG systems, AI agents, and document parsing. Includes metrics, code examples, and testing methodologies for real-world AI applications.
2How to Utilize CrashSight's Symbol Table Tool for Efficient Debugging Learn how to use CrashSight's Symbol Table Tool to extract and upload symbol table files, enabling efficient debugging and crash analysis for your apps.
3How to Enhance Your Performance Testing with PerfDog Custom Data Extension Discover how to integrate PerfDog Custom Data Extension into your project for more accurate and convenient performance testing and analysis.
4Mobile Game Performance Testing in 2026: Complete Guide with PerfDog Insights from Tencent’s Founding Developer Master mobile game optimization with insights from PerfDog’s founding developer. Learn to analyze 200+ metrics including Jank, Smooth Index, and FPower. The definitive 2026 guide for Unity & Unreal Engine developers to achieve 120FPS and reduce battery drain.
5Hybrid Remote Device Management: UDT Automated Testing Implementation at Tencent Learn how Tencent’s UDT platform scales hybrid remote device management. This case study details a 73% increase in device utilization and WebRTC-based automated testing workflows for global teams.