New weapon for crawlers: Reveal the secret of Github’s popular open source IP proxy pool!

2024.03.23



Open sourcenetwork
If your IP is blocked, don't worry, you can randomly change it to another proxy IP address and continue to complete the access request easily. Today, let us briefly understand the installation and use of IP proxy pools, and master the tips of hiding IPs.

When encountering web crawling, vulnerability mining or penetration testing, we often encounter the trouble of requests being intercepted, causing task interruption. In order to continue sending request data, proxy pool technology came into being. It is like a magical "pool". Just request the proxy pool and you will get a proxy IP address. If your IP is blocked, don't worry, you can randomly change it to another proxy IP address and continue to complete the access request easily. Today, let us briefly understand the installation and use of IP proxy pools, and master the tricks of hiding IP!

Recently I discovered a very excellent project on GitHub, a free proxy pool tool called proxy_pool[1]. The project is completely open source, has been actively maintained by developers, and is highly active.

Project Introduction

The proxy_pool project is developed in Python language and mainly implements the following functions:

  • Catch free proxy websites regularly and have simple scalability.
  • Use Redis to store agents and sort agents by their availability.
  • Conduct regular testing and screening, remove unavailable agents, and retain available agents.
  • Proxy API is provided to randomly obtain tested and available proxies.

At present, the project has gained 5.3K stars (GitHub Star) and has received widespread attention and recognition.

Deployment method

You can run a proxy pool in two ways. One way is to use Docker (recommended), the other is to run it the regular way. Specific requirements are as follows:

1.Docker

If you use Docker, you need to install the following environment:

  • Docker
  • Docker-Compose

Just search for the installation method yourself. Official Docker Hub image: germey/proxypool[2]

2. Conventional way

The conventional method requires a Python environment and a Redis environment. The specific requirements are as follows:

  • Python>=3.6
  • Redis

Docker run

If Docker and Docker-Compose are already installed, you can easily run them with just one command.

docker-compose up
  • 1.

The running results are similar to the following:

redis        | 1:M 19 Feb 2020 17:09:43.940 * DB loaded from disk: 0.000 seconds
redis        | 1:M 19 Feb 2020 17:09:43.940 * Ready to accept connections
proxypool    | 2020-02-19 17:09:44,200 CRIT Supervisor is running as root.  Privileges were not dropped because no user is specified in the config file.  If you intend to run as root, you can set user=root in the config file to avoid this message.
proxypool    | 2020-02-19 17:09:44,203 INFO supervisord started with pid 1
proxypool    | 2020-02-19 17:09:45,209 INFO spawned: 'getter' with pid 10
proxypool    | 2020-02-19 17:09:45,212 INFO spawned: 'server' with pid 11
proxypool    | 2020-02-19 17:09:45,216 INFO spawned: 'tester' with pid 12
proxypool    | 2020-02-19 17:09:46,596 INFO success: getter entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
proxypool    | 2020-02-19 17:09:46,596 INFO success: server entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
proxypool    | 2020-02-19 17:09:46,596 INFO success: tester entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.

You can see that Redis, Getter, Server, and Tester have all been started successfully. At this time, visit http://localhost:5555/random to obtain a randomly available proxy.

Of course, you can also choose to build your own and just run the following command:

docker-compose -f build.yaml up
  • 1.

use

After successful operation, you can obtain a randomly available proxy through http://localhost:5555/random.

It can be implemented through program docking. The following example shows the process of obtaining the proxy and crawling the web page:

import requests

proxypool_url = 'http://127.0.0.1:5555/random'
target_url = 'http://httpbin.org/get'

def get_random_proxy():
    """
    get random proxy from proxypool
    :return: proxy
    """
    return requests.get(proxypool_url).text.strip()

def crawl(url, proxy):
    """
    use proxy to crawl page
    :param url: page url
    :param proxy: proxy, such as 8.8.8.8:8888
    :return: html
    """
    proxies = {'http': 'http://' + proxy}
    return requests.get(url, proxies=proxies).text


def main():
    """
    main method, entry point
    :return: none
    """
    proxy = get_random_proxy()
    print('get random proxy', proxy)
    html = crawl(target_url, proxy)
    print(html)

if __name__ == '__main__':
    main()
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • twenty one.
  • twenty two.
  • twenty three.
  • twenty four.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.

The running results are as follows:

get random proxy 116.196.115.209:8080
{
  "args": {},
  "headers": {
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate",
    "Host": "httpbin.org",
    "User-Agent": "python-requests/2.22.0",
    "X-Amzn-Trace-Id": "Root=1-5e4d7140-662d9053c0a2e513c7278364"
  },
  "origin": "116.196.115.209",
  "url": "https://httpbin.org/get"
}
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.

You can see that the proxy was successfully obtained, and httpbin.org was requested to verify the availability of the proxy.

write to the end

In network data acquisition and security testing, the proxy pool is a very useful tool that can help users effectively manage and utilize proxy resources and improve work efficiency. Through the introduction of this article, I hope you will have a deeper understanding of the principles and uses of proxy pools. Whether you are a developer or a security engineer, mastering the skills of using proxy pools will become a weapon in your work.

Reference:

  • [1]proxy_pool:https://github.com/Python3WebSpider/ProxyPool
  • [2]germey/proxypool:https://hub.docker.com/r/germey/proxypool