The campus network is disconnected and reconnected, use a crawler to fix it!
JS decryption,
of course, with the diversified and colorful ways of packing and anti-climbing
methods, it is really difficult for many websites, especially those with
commercial data.
Preface
Hello,
everyone, I am Dasai brother (brother). I haven't seen it for a long time and I
really miss it.
Recently, I
have studied the encryption of the two logins due to small needs, and
successfully decrypted the encrypted parameters. I will share with you here.
Some time ago,
there was a classmate whose lab server campus network kept dropping, and wanted
to ask if there is any way to reconnect after disconnection.
I didn’t study
because I was busy at the time, and I haven’t done it for a long time. I
studied and analyzed it when I was fine yesterday. This process may be a piece
of cake for people with basic knowledge, but for those who haven’t known it,
you can experience it. It may be needed later. Sometimes there will be another
goodbye, which is simple.
The category
of this content belongs to the advanced level of crawlers: JS decryption. Of
course, with the various ways of packing and anti-crawling, the anti-climbing
methods are becoming more and more sophisticated. Many websites, especially
those with commercial data, are really difficult to engage.
Most websites
need to be authenticated and managed. Many pages and operations need to be
authenticated before they can be accessed. Logging in is the most critical step
of authentication. We want to be unimpeded on a page, most of which require
users to log in. Login is the first problem that many crawlers have to solve,
and in many cases it is also the most complicated and difficult part of the
entire crawler. Only when the login is done can we use the program.
There are two
login situations listed above. The first situation is generally rare, but the
login we wrote in the student phase is to achieve login. The plaintext is not
encrypted, but this situation is not safe, so most logins or requests will have
some parameters For some encryption, if we want to simulate this login with a
program, we need to understand the formation process of each parameter to
simulate, generate and send. Of course, login is actually the most difficult
verification code problem due to limited ability. I have not studied it and will
not explain it here. Most websites do not have a verification code when the
error is low, so you can still try to do it in most scenarios.
The campus
network can only access the Internet after successfully logging in through
http. The login parameter password is encrypted. I will share with you a small
analysis based on my own environment below.
Analyze
So much has
been introduced before, let's go straight to the topic and begin to analyze
this problem.
For the campus
network, we connect to its wifi or network cable. We are in the local area
network of the campus network, and network traffic is costly. When you access
the external network, you cannot access external services if you do not obtain
authentication and authorization. Yes, you can access the Internet only if you
successfully log in to the campus network platform.
However, there
are many ways to log in. The first step is to observe the login situation. I
roughly divide it into two types, one is ordinary form login, and the other is
Ajax dynamic login.
How to
distinguish between the two?
It's very
simple, check if the url has changed when logging in (cool??).
You see, the
url of the campus network login page of a school is unchanged after login, so
this is the case of Ajax login.
Is there a
difference between the two? The difference is not big, but Ajax generally does
not need to use professional capture tools, and the login of some form forms
may involve various redirects, new pages may not be good for the browser to
capture the corresponding information, and then need Use some fiddler,
wireshark and other tools to capture packets.
First of all,
we have to open the F12 of the browser, open the network item, and then click
into the small directory of XHR. If there is all, there is too much content,
and a small part of the data may be hidden in JavaScript (normally not). The
doc is generally the main page, if the ordinary form form, it depends on the
doc request.
After clicking
login, you can see the content of each request interaction. You will find that
there is a login on this webpage, and there is a get challenge on the login,
first click on login to view the parameters carried.
You can see
that there are three parameters for this request, namely username, password,
and an unknown challenge, but there is a get challenge request above, and then
I look at it and it turns out that there is a challenge parameter. Of course,
if there are other parameters, it may be straightforward. In the page, it may
also be dynamically generated by encryption. You have to analyze it yourself.
From the figure above, we can find that we only need to decipher the encryption
method of this password (people who are sensitive to data may have guessed it
What is encrypted).
Now that you
know which parameter needs to be resolved, you can generally start from two
aspects. The first one is to use the browser element to locate the login
button, and search the global search to see where it is used in js. You can
debug the logic, but At this time, this kind of scheme seems to be from front
to back, in fact, it is difficult to find some useful content, because you
don't know that its parameters may be encrypted after you fill in it, so this
method is not recommended.
Search directly on the parameters.
There are three parameters of username, password, and challenge. You can search
directly. Here I will search for password to see where password is used. You
can search for words including login. In the end, I saw the login logic
somewhere. This password should be encrypted through the createChapPassword
method.
We hit a breakpoint here, and then
click to log in. The program successfully reached the breakpoint, and our
account and password are still clear text at this time, indicating that the
data is still unencrypted. From here, we will start to walk through the logic.
Entering to view the function, I found
that the core content is here.
1. var createChapPassword = function(password){
2.
var id = '';
3.
var challenge = '';
4.
var str = '';
5.
6.
id = Math.round(Math.random()*10000)%256;
7.
8.
$.ajax({
9.
type : 'POST',
10.
url : globalVar.io_url + 'getchallenge',
11.
dataType : 'json',
12.
timeout : 5000,
13.
cache : false,
14.
async : false,
15.
success : function(resp){
16.
if(resp && (resp.reply_code != null) && (resp.reply_code == 0)) challenge =esp.challenge
17.
}
18.
});
19.
20.
str += String.fromCharCode(id);
21.
str += password;
22.
23.
for(i=0;i<challenge.length;i+=2){
24.
var hex = challenge.substring(i,i+2);
25.
var dec = parseInt(hex,16);
26.
str += String.fromCharCode(dec);
27.
}
28.
29.
var hash = $.md5(str);
30.
31.
chappassword = ((id<16) ? "0" : "") + id.toString(16) + hash;
32.
33.
return {password : chappassword , challenge : challenge};
34. };
The
logic here will give you an explanation of the logic. If you don’t understand it,
just use a search engine to search for it.
The
first is an id generated by a random number, and a fixed one can be selected
when it is reproduced in other languages.
Then
Ajax sends a request to get a challenge parameter, str first adds the id corresponding
to the Unicode character, and then adds the challenge two by two to form the
hexadecimal number corresponding to the Unicode character.
Perform
MD5 encryption on str once, and then piece it together to return the result.
So, the parameter encryption logic is here, we just need to reproduce it.
Logical reproduction
Then the fact is that the logic of recurrence is not that
simple. When I reproduced it honestly, there was no problem before, and
compared with the content of the browser, but when MD5 was implemented in
Python, the result was inconsistent with the MD5 encrypted content of the front
end.
This problem has really been investigated for a long time and
a lot of time was wasted. I will share with you the process.
What's the situation? The normal programming language encodes
the string first, and then MD5 encodes it, and the most well-known conventional
encoding method is utf-8, and the results of websites that use online
encryption are the same as those of Pyhton. same.
Then I tried to print the result of utf-8 encoding on the
console, and used the browser's console to encrypt my encoded string, and I
found a shocking scene! This result was actually the same as the control result
(the string of 33c9) .
This shows that the MD5 encryption library of JQuery does not
encode characters in utf-8 but uses other methods. We need to find this method
to implement in a programming language. After several attempts and searches, we
finally find an encoding format:
ISO-8859-1
This coding is
still a long time ago when learning JavaWeb server file download, the Chinese
name file name is abnormal, and the file is re-encoded and rarely touched after
encountering it. After replacing it with this code, the result we want is
finally printed out.
1.
ª124412ðRkhìy’LŒÁZosõ
2.
b'\xc2\xaa124412\xc3\xb0Rkh\xc3\xacy\xc2\x92L*\x08\xc2\x8c\xc3\x81Zos\xc3\xb5'
3.
hash 297ad4844ee638891233c9ca65df4d9c
4.
chappasword aa297ad4844ee638891233c9ca65df4d9c
This is complete. Write the code
encapsulation and try it out. Here I use Python to implement it. Java can also
be the same. I use the session of the requests module (this module
automatically keeps cookies), but the code can’t be used, it can only be the
same as the previous one. Compare the front-end JavaScript logic.
1.
import requests
2.
import hashlib
3.
import urllib
4.
5.
from requests import sessions
6.
7.
# header 请求头,通过浏览器请求抓包查看请求所需要的头信息,其中包括返回数据类型、浏览器等信息
8.
header={
9.
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
10.
'x-requested-with':'XMLHttpRequest',
11.
'accept':'application/json, text/javascript, */*; q=0.01',
12.
'accept-encoding':'gzip, deflate, br',
13.
'accept-language':'zh-CN,zh;q=0.9',
14.
'connection': 'keep-alive',
15.
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8'
16.
,'Host': 'm.njust.edu.cn'
17.
}
18.
#数据(携带这部分数据传到后台 账号密码等) 我们访问接口需要携带的参数,其中需要我们变换的就是name和password,讲输入的账号和密码赋值进去
19.
data={
20.
'username':'',
21.
'password':'',
22.
'challenge':''
23.
}
24.
def get_challenge():
25.
url = 'http://m.njust.edu.cn/portal/index.html'
26.
req = session.get(url)
27.
#print(req.text)
28.
req2 = session.post("http://m.njust.edu.cn/portal_io/getchallenge")
29.
challenge = req2.json()['challenge']
30.
return challenge
31.
def get_str2():
32.
str2 = chr(id)
33.
str2 = str2 + password
34.
35.
for i in range(len(challenge)):
36.
if i % 2 == 1:
37.
continue
38.
hex1 = challenge[i: i + 2]
39.
dec = int(hex1, 16)
40.
str2 = str2 + (chr(dec))
41.
return str2
42.
43.
def login():
44.
loginurl='http://m.njust.edu.cn/portal_io/login'
45.
req3=session.post(loginurl,data=data,headers=header)
46.
print(req3.text)
47.
48.
if __name__ == '__main__':
49.
# 第一次登录获取cookie
50.
id = 162
51.
session = requests.session()
52.
challenge = get_challenge()
53.
username = '12010xxxxxx49'
54.
password = "12xxxx2"
55.
str2 = get_str2()
56.
57.
hash = hashlib.md5(str2.encode('ISO-8859-1')).hexdigest()
58.
# 打印加密后的密码 #测试结果,是md5 32位加密
59.
print('hash',hash)
60.
61.
chappassword = hex(int(id))[2:] + hash ##前面的0X去掉
62.
print('chappasword', chappassword)
63.
64.
data['username'] = username
65.
data['password'] = chappassword
66.
data['challenge'] = challenge
67.
login()
It was ready
to launch. I didn’t log in to the network, and the network came. It seems that
our result was successful.
Successful
launch
Summarize
This problem
is not complicated for veterans, but it may be a novel and interesting thing
for many people. Of course, in recent years, crawlers are good for their own
fun. Design commercial or large-scale crawling of private data may be involved.
Danger, there will be a chance to share some pages of traditional login methods
later.
This small
encryption analysis is simple to reproduce, but it has been stuck for a long
time because of the encoding problem. In the final analysis, the foundation is
relatively weak. It wastes a lot of time to grasp these encryption algorithms
and the basic things at the bottom. You might just guess what encryption this
might be, what kind of encoding this data format is... But fortunately, we have
to make up for this blind spot through this demo.