The campus network is disconnected and reconnected, use a crawler to fix it!

JS decryption, of course, with the diversified and colorful ways of packing and anti-climbing methods, it is really difficult for many websites, especially those with commercial data.


Preface

Hello, everyone, I am Dasai brother (brother). I haven't seen it for a long time and I really miss it.

 

Recently, I have studied the encryption of the two logins due to small needs, and successfully decrypted the encrypted parameters. I will share with you here.

Some time ago, there was a classmate whose lab server campus network kept dropping, and wanted to ask if there is any way to reconnect after disconnection.


I didn’t study because I was busy at the time, and I haven’t done it for a long time. I studied and analyzed it when I was fine yesterday. This process may be a piece of cake for people with basic knowledge, but for those who haven’t known it, you can experience it. It may be needed later. Sometimes there will be another goodbye, which is simple.

 

The category of this content belongs to the advanced level of crawlers: JS decryption. Of course, with the various ways of packing and anti-crawling, the anti-climbing methods are becoming more and more sophisticated. Many websites, especially those with commercial data, are really difficult to engage.

 

Most websites need to be authenticated and managed. Many pages and operations need to be authenticated before they can be accessed. Logging in is the most critical step of authentication. We want to be unimpeded on a page, most of which require users to log in. Login is the first problem that many crawlers have to solve, and in many cases it is also the most complicated and difficult part of the entire crawler. Only when the login is done can we use the program.


There are two login situations listed above. The first situation is generally rare, but the login we wrote in the student phase is to achieve login. The plaintext is not encrypted, but this situation is not safe, so most logins or requests will have some parameters For some encryption, if we want to simulate this login with a program, we need to understand the formation process of each parameter to simulate, generate and send. Of course, login is actually the most difficult verification code problem due to limited ability. I have not studied it and will not explain it here. Most websites do not have a verification code when the error is low, so you can still try to do it in most scenarios.

 

The campus network can only access the Internet after successfully logging in through http. The login parameter password is encrypted. I will share with you a small analysis based on my own environment below.


Analyze

So much has been introduced before, let's go straight to the topic and begin to analyze this problem.

 

For the campus network, we connect to its wifi or network cable. We are in the local area network of the campus network, and network traffic is costly. When you access the external network, you cannot access external services if you do not obtain authentication and authorization. Yes, you can access the Internet only if you successfully log in to the campus network platform.

However, there are many ways to log in. The first step is to observe the login situation. I roughly divide it into two types, one is ordinary form login, and the other is Ajax dynamic login.

 

How to distinguish between the two?

 

It's very simple, check if the url has changed when logging in (cool??).



You see, the url of the campus network login page of a school is unchanged after login, so this is the case of Ajax login.

 

Is there a difference between the two? The difference is not big, but Ajax generally does not need to use professional capture tools, and the login of some form forms may involve various redirects, new pages may not be good for the browser to capture the corresponding information, and then need Use some fiddler, wireshark and other tools to capture packets.

 

First of all, we have to open the F12 of the browser, open the network item, and then click into the small directory of XHR. If there is all, there is too much content, and a small part of the data may be hidden in JavaScript (normally not). The doc is generally the main page, if the ordinary form form, it depends on the doc request.



After clicking login, you can see the content of each request interaction. You will find that there is a login on this webpage, and there is a get challenge on the login, first click on login to view the parameters carried.



You can see that there are three parameters for this request, namely username, password, and an unknown challenge, but there is a get challenge request above, and then I look at it and it turns out that there is a challenge parameter. Of course, if there are other parameters, it may be straightforward. In the page, it may also be dynamically generated by encryption. You have to analyze it yourself. From the figure above, we can find that we only need to decipher the encryption method of this password (people who are sensitive to data may have guessed it What is encrypted).

 

Now that you know which parameter needs to be resolved, you can generally start from two aspects. The first one is to use the browser element to locate the login button, and search the global search to see where it is used in js. You can debug the logic, but At this time, this kind of scheme seems to be from front to back, in fact, it is difficult to find some useful content, because you don't know that its parameters may be encrypted after you fill in it, so this method is not recommended.


Search directly on the parameters. There are three parameters of username, password, and challenge. You can search directly. Here I will search for password to see where password is used. You can search for words including login. In the end, I saw the login logic somewhere. This password should be encrypted through the createChapPassword method.

 

We hit a breakpoint here, and then click to log in. The program successfully reached the breakpoint, and our account and password are still clear text at this time, indicating that the data is still unencrypted. From here, we will start to walk through the logic.

 


Entering to view the function, I found that the core content is here.

 

1.       var createChapPassword = function(password){ 

2.           var id = ''

3.           var challenge = ''

4.           var str = ''

5.        

6.           id = Math.round(Math.random()*10000)%256; 

7.        

8.           $.ajax({ 

9.               type : 'POST'

10.           url : globalVar.io_url + 'getchallenge'

11.           dataType : 'json'

12.           timeout : 5000, 

13.           cache : false

14.           async : false

15.           success : function(resp){   

16.             if(resp && (resp.reply_code != null) && (resp.reply_code == 0))  challenge =esp.challenge

17.           } 

18.       }); 

19.    

20.       str += String.fromCharCode(id); 

21.       str += password

22.    

23.       for(i=0;i<challenge.length;i+=2){ 

24.           var hex = challenge.substring(i,i+2); 

25.           var dec = parseInt(hex,16); 

26.           str += String.fromCharCode(dec); 

27.       } 

28.    

29.       var hash = $.md5(str); 

30.    

31.       chappassword = ((id<16) ? "0" : "") + id.toString(16) + hash; 

32.    

33.       return {password : chappassword , challenge : challenge}; 

34.   };

 

The logic here will give you an explanation of the logic. If you don’t understand it, just use a search engine to search for it.

 

The first is an id generated by a random number, and a fixed one can be selected when it is reproduced in other languages.

 

Then Ajax sends a request to get a challenge parameter, str first adds the id corresponding to the Unicode character, and then adds the challenge two by two to form the hexadecimal number corresponding to the Unicode character.

 

Perform MD5 encryption on str once, and then piece it together to return the result. So, the parameter encryption logic is here, we just need to reproduce it.

 

Logical reproduction

Then the fact is that the logic of recurrence is not that simple. When I reproduced it honestly, there was no problem before, and compared with the content of the browser, but when MD5 was implemented in Python, the result was inconsistent with the MD5 encrypted content of the front end.

 

This problem has really been investigated for a long time and a lot of time was wasted. I will share with you the process.

 


 

What's the situation? The normal programming language encodes the string first, and then MD5 encodes it, and the most well-known conventional encoding method is utf-8, and the results of websites that use online encryption are the same as those of Pyhton. same.

 

Then I tried to print the result of utf-8 encoding on the console, and used the browser's console to encrypt my encoded string, and I found a shocking scene! This result was actually the same as the control result (the string of 33c9) .

 

This shows that the MD5 encryption library of JQuery does not encode characters in utf-8 but uses other methods. We need to find this method to implement in a programming language. After several attempts and searches, we finally find an encoding format:

 

ISO-8859-1

This coding is still a long time ago when learning JavaWeb server file download, the Chinese name file name is abnormal, and the file is re-encoded and rarely touched after encountering it. After replacing it with this code, the result we want is finally printed out.

1.                ª124412ðRkhìy’LŒÁZosõ 

2.                b'\xc2\xaa124412\xc3\xb0Rkh\xc3\xacy\xc2\x92L*\x08\xc2\x8c\xc3\x81Zos\xc3\xb5' 

3.                hash 297ad4844ee638891233c9ca65df4d9c 

4.                chappasword aa297ad4844ee638891233c9ca65df4d9c 

 

This is complete. Write the code encapsulation and try it out. Here I use Python to implement it. Java can also be the same. I use the session of the requests module (this module automatically keeps cookies), but the code can’t be used, it can only be the same as the previous one. Compare the front-end JavaScript logic.

 

1.               import requests 

2.               import hashlib 

3.               import urllib 

4.                

5.               from requests import sessions 

6.                

7.               # header 请求头,通过浏览器请求抓包查看请求所需要的头信息,其中包括返回数据类型、浏览器等信息 

8.               header={ 

9.                'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'

10.            'x-requested-with':'XMLHttpRequest'

11.            'accept':'application/json, text/javascript, */*; q=0.01'

12.            'accept-encoding':'gzip, deflate, br'

13.            'accept-language':'zh-CN,zh;q=0.9'

14.            'connection''keep-alive'

15.            'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8' 

16.            ,'Host''m.njust.edu.cn' 

17.        } 

18.       #数据(携带这部分数据传到后台 账号密码等我们访问接口需要携带的参数,其中需要我们变换的就是namepassword,讲输入的账号和密码赋值进去 

19.       data={ 

20.           'username':''

21.           'password':''

22.           'challenge':'' 

23.      

24.       def get_challenge(): 

25.           url = 'http://m.njust.edu.cn/portal/index.html' 

26.           req = session.get(url) 

27.           #print(req.text) 

28.           req2 = session.post("http://m.njust.edu.cn/portal_io/getchallenge"

29.           challenge = req2.json()['challenge'

30.           return challenge 

31.       def get_str2(): 

32.           str2 = chr(id) 

33.           str2 = str2 + password 

34.        

35.           for i in range(len(challenge)): 

36.               if i % 2 == 1: 

37.                   continue 

38.               hex1 = challenge[i: i + 2] 

39.               dec = int(hex1, 16) 

40.               str2 = str2 + (chr(dec)) 

41.           return str2 

42.        

43.       def login(): 

44.           loginurl='http://m.njust.edu.cn/portal_io/login' 

45.           req3=session.post(loginurl,data=data,headers=header) 

46.           print(req3.text) 

47.        

48.       if __name__ == '__main__'

49.           # 第一次登录获取cookie 

50.           id = 162 

51.           session = requests.session() 

52.           challenge = get_challenge() 

53.           username = '12010xxxxxx49' 

54.           password = "12xxxx2" 

55.           str2 = get_str2() 

56.        

57.           hash = hashlib.md5(str2.encode('ISO-8859-1')).hexdigest() 

58.           # 打印加密后的密码  #测试结果,是md5 32位加密 

59.           print('hash',hash) 

60.        

61.           chappassword = hex(int(id))[2:] + hash  ##前面的0X去掉 

62.           print('chappasword', chappassword) 

63.        

64.           data['username'] = username 

65.           data['password'] = chappassword 

66.           data['challenge'] = challenge 

67.           login() 

 

It was ready to launch. I didn’t log in to the network, and the network came. It seems that our result was successful.


                                                                        Successful launch

 

Summarize

This problem is not complicated for veterans, but it may be a novel and interesting thing for many people. Of course, in recent years, crawlers are good for their own fun. Design commercial or large-scale crawling of private data may be involved. Danger, there will be a chance to share some pages of traditional login methods later.

 

This small encryption analysis is simple to reproduce, but it has been stuck for a long time because of the encoding problem. In the final analysis, the foundation is relatively weak. It wastes a lot of time to grasp these encryption algorithms and the basic things at the bottom. You might just guess what encryption this might be, what kind of encoding this data format is... But fortunately, we have to make up for this blind spot through this demo.