What are CDNs? Is using a CDN necessarily faster than not using it?
What are CDNs? Is using a CDN necessarily faster than not using it?
For development students, the word CDN is both familiar and unfamiliar.
I rarely need to touch this when doing development, but I can always hear others mention it.
We have all heard that it can accelerate, and probably know the reason, but we asked deeper.
Is using a CDN necessarily faster than not using it?
I feel a little confused. But it doesn't matter, today we will re-understand CDN from another angle.
What is a CDN
For numeric and text-type data, such as name and phone number-related information. We need a place to store it.
We usually use mysql database to save.
The text is stored in mysql
When we need to retrieve this data again, we need to read the mysql database.
But because the data of mysql is stored on the disk, the read performance of a single instance is almost 5kqps, which is already very good.
It seems to be okay, but for a slightly larger system, it is a little bit urgent.
In order to improve performance, we add a layer of memory before mysql as a cache layer, such as redis, which is often said, and read data in memory first, and then read in mysql if it cannot be read, which greatly reduces the number of times to read mysql. With this set of combination punches, the reading performance can easily reach tens of thousands of qps.
mysql and redis
Ok, so far, we are talking about development scenarios that we usually have access to more easily.
But if what I want to deal with now is no longer the text data mentioned above, but image data.
For example, I have a handsome photo. Just the one below.
Every time I swipe a certain sound and hear someone cover Cai Tanya's "letting go", I can't help but want to post this picture.
And with the text "I still can't forget".
So here comes the problem.
Where should this image data exist? , and where should I read it?
Let's look back at the scenarios of mysql and redis, which are nothing more than the storage layer plus the cache layer.
storage layer and caching layer
For file objects such as pictures, it is unlikely that the storage layer will use mysql again, and professional object storage should be used instead, such as Amazon's S3 (Amazon Simple Storage Service, note that there are three words beginning with S, so it is called s3), or Alibaba Cloud's oss (Object Storage Service). For the following content, we will use the more common oss to explain.
As for the caching layer, redis cannot continue to be used, and it needs to be changed to use CDN (Content Delivery Network, content distribution network).
CDN can be simply understood as the cache layer corresponding to object storage.
CDN and OSS
Now we can answer the above question. For the user, the image data is stored in the object storage, and when necessary, it will be read from the CDN.
How CDNs Work
Now that we have CDN and object storage, let's see how they work together.
For the pictures we usually see, we can right-click to copy and view its URL.
You will find that the URL of the picture looks like this.
https://cdn.xiaobaidebug.top/1667106197000.png
- 1.
Among them, cdn.xiaobaidebug.top in front is the domain name of CDN, and 1667106197000.png in the back is the path name of the image.
When we enter this URL in the browser, an HTTP GET request will be initiated, and then go through the following process.
CDN query process
The first stage: Your computer will first obtain the IP corresponding to the domain name cdn.xiaobaidebug.top through the DNS protocol.
• step1 and step2: Check the browser cache first, and then check the /etc/hosts cache in the operating system. If there is none, it will ask the nearest DNS server (such as the home router in your room). Whether there is a corresponding cache on the nearest DNS server, and return if there is.
• step3: If there is no corresponding cache on the nearest DNS server, it will query the root domain, first-level domain, second-level domain, and third-level domain servers.
• step4: Then, the nearest DNS server will get the alias (CNAME) of this cdn.xiaobaidebug.top domain name, such as cdn.xiaobaidebug.top.w.kunlunaq.com.
• kunlunaq.com is a dedicated DNS scheduling system for Ali CDN.
• Step5 to step7: At this time, the nearest DNS server will request this kunlunaq.com, and then return an IP address closest to you to you.
The second stage: corresponding to step8 in the above figure. The browser uses this IP to access the CDN node, and then the CDN node returns data.
In the first stage of the process above, many new terms were mentioned, such as CNAME, root domain, first-level domain, etc. They were described in detail in the previous article "What are the excellent designs worth learning in DNS", if not If you understand, you can take a look.
We know that the purpose of DNS is to obtain an IP address through a domain name.
But that's just one of its many features.
There are many types of DNS messages, among which type A is to use the domain name to check the IP address corresponding to the domain name. The CNAME type is to use the domain name to check the alias of the domain name.
For ordinary domain names, after DNS resolution, the IP address corresponding to the domain name can be directly obtained (also called A type record, A refers to Address).
For example, below, I use the dig command to make a DNS request and print the process data.
$ dig +trace xiaobaidebug.top
;; ANSWER SECTION:
xiaobaidebug.top. 600 IN A 47.102.221.141
- 1.
- 2.
- 3.
You can see that xiaobaidebug.top directly resolves to get the corresponding IP address 47.102.221.141.
But for the cdn domain name, after a wave of inquiries, the first thing you get is a CNAME record xx.kunlunaq.com, and then dig this xx.kunlunaq.com to get the corresponding IP address.
$ dig +trace cdn.xiaobaidebug.top
cdn.xiaobaidebug.top. 600 IN CNAME cdn.xiaobaidebug.top.w.kunlunaq.com.
$ dig +trace cdn.xiaobaidebug.top.w.kunlunaq.com
cdn.xiaobaidebug.top.w.kunlunaq.com. 300 IN A 122.228.7.243
cdn.xiaobaidebug.top.w.kunlunaq.com. 300 IN A 122.228.7.241
cdn.xiaobaidebug.top.w.kunlunaq.com. 300 IN A 122.228.7.244
cdn.xiaobaidebug.top.w.kunlunaq.com. 300 IN A 122.228.7.249
cdn.xiaobaidebug.top.w.kunlunaq.com. 300 IN A 122.228.7.248
cdn.xiaobaidebug.top.w.kunlunaq.com. 300 IN A 122.228.7.242
cdn.xiaobaidebug.top.w.kunlunaq.com. 300 IN A 122.228.7.250
cdn.xiaobaidebug.top.w.kunlunaq.com. 300 IN A 122.228.7.251
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
Seeing this, the problem comes again.
Why is it so troublesome to add a CNAME?
What the CNAME points to is actually a DNS domain name server dedicated to the CDN. It is just one of the small DNS domain name servers in the entire DNS system, and it looks like other domain name servers, unremarkable. DNS requests will also enter this server normally.
But when the request actually hits it, its special feature is reflected. When the query request hits the domain name server, it is enough for the ordinary DNS domain name server to return the part of the IP corresponding to the domain name, but the CDN dedicated DNS The domain name server will ask to return the "nearest" server IP from the caller.
The DNS resolution server dedicated to CDN will return the IP of the nearest CDN node
How do you know which server IP is the closest to the caller?
You can see that the word "recently" is actually enclosed in double quotation marks.
The DNS domain name server dedicated to CDN is actually provided by the CDN provider. For example, Alibaba Cloud certainly knows what its own CDN nodes are, as well as the current load status, response delay and even weight of these CDN servers, and can also know the calling party. What is the IP address of the caller? You can know the operator it belongs to and its approximate location through the IP of the caller, and filter out the most suitable CDN server according to the conditions. This is the so-called "nearest".
for example. Assuming that the closest CDN computer room has more traffic and slower response, but the server located farther away can better respond to the current request, it stands to reason that the CDN server located farther away may be selected.
That is to say, the selected server may not necessarily be geographically closest, but it must be the most suitable server at present.
what is back to source
The image URL above is in the form of https://cdn domain name/image address.png.
In other words, this picture is obtained by accessing the CDN.
So, can direct access to object storage get image data and display it?
For example like the following.
https://oss域名/图片地址.png
- 1.
This is like asking, if you don't use redis, can you read text data directly from mysql and display it.
Of course.
This is what I did with the pictures I posted on my blog earlier.
But this is more costly. The cost here can refer to performance cost or call cost. Take a look at the picture below.
It can be seen that the cost of directly requesting oss is almost twice that of requesting oss through CDN. Considering that my family is poor, and in order to make the blog get pictures faster, I connected to CDN.
But seeing this, the problem came again.
In the screenshot above, there is a word in the red box called "return to the source".
What is back to the source?
When we visit https://cdn domain name/image address.png, the request will hit the CDN server.
But the CDN server is essentially a layer of cache, not the data source, and object storage is the data source.
When accessing the CDN for the first time to obtain a certain picture, there is a high probability that the data of this picture does not exist in the CDN, so it is necessary to go back to the data source to retrieve the picture data. Then put it on the cdn. The next time you access the CDN again, as long as the cache does not expire, you can hit the cache and return directly, so you don't need to go back to the source again.
So the access process becomes as follows.
Then what other situations will happen back to the source?
In addition to the above-mentioned cdn that cannot get data and will return to the source site, there is also a cache on the cdn that expires and will also cause the source site to be returned.
In addition, even if there is a cache and the cache does not expire, the open interface provided by the CDN can also be used to trigger active back-to-source, but we rarely have access to this.
In addition, the user cannot perceive the matter of returning to the source, because when the user reads the picture, he can only know whether he has read it or not.
It is also read, and it is further subdivided into whether it is read directly from the CDN, or it is returned after the CDN returns to the source to read the object storage.
The difference between returning directly with cache and returning to the source without cache
So, do we have a way to judge whether a back-to-source has occurred?
have. Let's look down.
How to judge whether back-to-source occurs
Let's take the object storage and CDN of a certain cloud as an example.
Suppose I want to request the following picture https://cdn.xiaobaidebug.top/image/image-20220404094549469.png
In order to view the http header of the response data more conveniently, we can use postman.
Use the GET method to request image data.
Then switch to view the response header information through the tab below.
view response header
Back to the source
At this time, check that the value of X-Cache under the response header is MISS TCP_MISS. This means that the cache miss causes the CDN to return to the source to check the oss, and return after getting the data.
Then there must be a cache of this picture in the CDN at this time. We can try to execute the GET method again to get the picture.
The value of X-Cache becomes HIT TCP_MEM_HIT, which is the cache hit.
This is what Liyun does. Others, such as Tengyun, are pretty much the same, and you can almost find relevant information from the response header.
Is it faster to use a CDN than not to use it?
See here we can answer the question at the beginning of the article.
If you do not access the CDN, directly access the source site, the process is as follows.
Update direct access to source site
However, if a CDN is connected and there is no cached data on the CDN, a back-to-origin will be triggered.
After the update is gone, the CDN will return to the source
It is equivalent to adding a layer of CDN call process to the original process.
That is, when a CDN is used, a miss in the CDN cache leads to a return to the source, which will be slower than when it is not used.
A cache miss may be due to the fact that there is no such data in the CDN at all, or it may be that this data once existed but expired later.
Both of these situations are normal and most of the time do not need any treatment.
But for very rare scenarios, we may need to do some optimization. For example, there is a major version update of your source site data, such as changing the CDN domain name. At the moment when it goes online, all users use the new CDN domain name to request pictures. The new CDN node basically triggers a 100% return to the source. Seriously Sometimes it may even drag down the object storage. At this time, you may need to filter out hot data in advance, use tools to pre-request a wave, and let CDN load the hot data cache. For example, the CDN on a certain cloud has such a "refresh warm-up" function.
cdn refresh preheating
Of course, it is also possible to use the grayscale release model to let a small number of users experience new features first, let these users "warm up" the CDN, and then gradually release the traffic.
In addition, there used to be this data but it expired later. For hot data, you can properly increase the cache time of CDN data.
When should you not use a CDN?
From the above description, the biggest advantage of CDN is that for users from all over the world, it can allocate CDN nodes nearby to obtain data, and when repeatedly obtaining the same file data, it has the effect of caching acceleration.
This is perfect for scenes like web page pictures. Because the bottom layer uses object storage, that is to say, as long as it is a file object, such as a video, you can use this process to access the CDN for acceleration. For example, the short video of a certain sound and a certain hand that is usually used is done in this way.
Then think about it in turn, and here comes the problem.
When should you not use a CDN?
If you have a company intranet service, and the pictures and other files requested by the service are unlikely to be called repeatedly, there is actually no need to use a CDN at this time.
Note the two bolded key points above.
- The intranet service is to ensure that you understand the source of the service request and can also get the read permission of the object storage, and if your object storage is also within the company, there is a high probability that it is already in the same computer room as your service , which is pretty close. Access to the CDN also does not enjoy the benefits of "distributing CDN nodes nearby".
- Pictures or other files are unlikely to be reused multiple times. If you access the CDN, then every time you go to the CDN to obtain pictures, there is a high probability that the CDN node does not have the data you want, which means you need to go back to the source every time. Go to Object Storage and grab one. Then accessing the CDN is equivalent to adding a layer of proxy to yourself, and one more layer of proxy will be more time-consuming.
Regarding the second point above, if you need a clear indicator to convince yourself, I can give you one. From the above introduction, we know that through the X-Cache field in the http header of the CDN response, we can see whether a request has triggered back-to-origin, count the number of times, and then divide by the total number of requests to get the back-to-origin The proportion, for example, the proportion of returning to the source is as high as 90%, so why do you connect to CDN?
Summarize
- For text data, we are used to using mysql for storage and redis for caching. But for file-type data, such as video images, you need to use oss, etc. for object storage, and CDN for caching.
- If a back-to-source occurs when a CDN is used, it will actually be slower than when it is not used.
- The biggest advantage of CDN is that for users from all over the world, it can allocate CDN nodes nearby to obtain data, and when repeatedly obtaining the same file data, it has the effect of caching acceleration. If your service and object storage are all on the intranet, and the file data is unlikely to be reused, then there is actually no need to access the CDN.