11 reasons why YouTube supports 100 million video views per day with just 9 engineers

11 reasons why YouTube supports 100 million video views per day with just 9 engineers


Initially, their financial resources were limited and they could only finance YouTube through credit card debt and infrastructure borrowing. But financial constraints also forced them to create an excellent scalability technology. In the second year, the daily video views on their platform reached 100 million. What's even more surprising is that they did it with only 9 engineers.

Author | NK

Planning | Comments 

February 2005, California, USA. PayPal, a world-renowned online payment service company, has been around for 6 years and 2 months. Just like discovering the traffic password of the Internet world, three early employees began to look for their own opportunities.

Ultimately, they hope to build a platform for sharing videos. Later, the platform that was born in the garage became the famous YouTube.

Initially, their financial resources were limited and they could only finance YouTube through credit card debt and infrastructure borrowing. But financial constraints also forced them to create an excellent scalability technology.

In the second year, the daily video views on their platform reached 100 million. What's even more surprising is that they did it with only 9 engineers. 

How does YouTube do it? Below, we will reveal the key points of the design of that year one by one. (Ps: At first glance, it looks simple and unpretentious.)

1. Magic flywheel

They adopt a "flywheel" approach to collecting and analyzing system data to facilitate scalability. Their workflow is a continuous cycle: identify bottlenecks → fix bottlenecks → drink water → sleep. The advantage of this method is that it avoids the need for high-end hardware (without large-scale deployment) and reduces hardware costs.

Scalability loopScalability loop

2. A technology stack that seems boring but inefficient

They keep their technology stack simple and use proven technologies. Their technology stack is beyond your imagination:

YouTube technology stackYouTube technology stack

  • MySQL stores metadata: video title, tags, description, and user data. Because fixing problems in MySQL is easy. 
  • Lighttpd web server provides video services. 
  • Use Linux as the operating system. They use Linux tools to examine the following system behavior: strace, ssh, rsync, vmstat, and tcpdump. 
  • Python on the application server. Because it provides many reusable libraries and they don't want to reinvent the wheel. In other words, Python allows for fast and flexible development. According to their measurements, Python was never a bottleneck. Notably, they use a Python to C compiler and C language extensions to run CPU-intensive tasks.

3. Keep it simple

They believe that software architecture is the root of scalability. They didn’t blindly pursue “buzzwords” to scale. Therefore, they keep the architecture simple - making code review easier. This allows them to quickly re-architect to meet changing needs. For example, they moved from dating sites to video sharing sites.

Additionally, they keep network paths simple. Because network equipment has scalability limitations.

Hardware costHardware cost

They also used commodity hardware. It enables them to reduce power consumption and maintenance expenses and keep costs low.

Additionally, they make scale-aware code relatively independent of application development.

4. Choose your main battlefield

They outsource many unimportant problems. Because they want to focus on what's important. They don't have the time or resources to build their own infrastructure to serve popular videos. So, they put the popular videos on a third-party CDN. benefit:

  • Low latency. Because the user’s network hop count is small;
  • high performance. Because it provides video in memory; 
  • High availability. Because of automatic copying.

They serve slightly less popular videos from a co-located data center. And uses software RAID to improve performance through parallel access to multiple disks. Also tuned their servers to prevent cache thrashing.

They keep their infrastructure in a co-located data center for two reasons. For one, the server can be easily tuned to suit its needs. Second, it facilitates your own contract negotiations.

Choose your battlegrounds; outsource problems to free up resourcesChoose your battlegrounds; outsource problems to free up resources

Each video has 4 thumbnails. Therefore, they face problems when serving small objects: large disk seeks and file system limitations. So they put the thumbnails into a BigTable. It is a distributed data store with many advantages: avoiding small file problems by clustering files, improved performance, low latency with multi-level caching, and easy configuration.

They also falsify data to prevent expensive transaction fees. For example, they fake video views and update the counter asynchronously. A popular technique for approximate correctness today is the Bloom filter, which is a probabilistic data structure.

5. Three pillars of scalability

YouTube relies on three pillars of scalability: statelessness, replication, and partitioning.

3 pillars of scalability3 pillars of scalability

They keep web servers stateless and scale through replication.

They replicate replicated database servers for read scalability and high availability. And load balance traffic between replicas. But this approach caused problems: replication lag and write scalability issues.

Replication and PartitioningReplication and Partitioning

Therefore, they partitioned the database to improve write scalability, cache locality, and performance. partitioned also reduces hardware costs by 30%.

Additionally, they studied data access patterns to determine partitioning levels. For example, they looked at popular queries, joins, and transactional consistency, and chose user as the partitioning level.

6. Solid engineering team

A knowledgeable team is a great asset to scalability.

interdisciplinary teaminterdisciplinary team

They keep the team size small to improve communication: just 9 engineers. Their team is very good at interdisciplinary skills.

7. Don’t repeat yourself

They use cache to prevent repeated expensive operations. It enables them to scale their views.

Multi-level cache scalableMulti-level cache scalable

They also implemented caching at multiple levels - and reduced latency.

8. Sorting: Important indicators should be given priority

Rank important traffic; 80/20 rule (Pareto principle)Rank important traffic; 80/20 rule (Pareto principle)

They prioritize video view traffic above all other traffic. Therefore, they reserve a dedicated cluster of resources for video viewing traffic. This provides high availability.

9. Prevent "Thunder Swarm"

If many concurrent clients query the server, a thunderstorm problem can occur. It will reduce performance.

The Thundering Herd problemThe Thundering Herd problem

They use dithering to prevent thunder swarm problems. For example, they added jitter for cache expiration of popular videos.

10. Fight a protracted war

They focus on macro-level things: algorithms and scalability. They performed quick hacks to buy more time to build long-term solutions. For example, use Python to eliminate bad APIs to prevent short-term problems.

Adventure and rewardAdventure and reward

They tolerate defects in components. When they hit a bottleneck: they either rewrite the component or delete it.

They trade efficiency for scalability. There are four examples:

  • They chose Python over C; 
  • They maintain clear boundaries between components for horizontal scaling. and delays tolerated;
  • They optimized the software to make it fast enough. But not obsessed with machine efficiency; 
  • They serve videos from server locations based on bandwidth availability. And not based on latency.

11. Adaptive evolution

They adapted the system to meet their needs. example:

  • Key components use RPC instead of HTTP REST, which improves performance; 
  • Customize BSON as data serialization format. It provides high performance;
  • Eventual consistency of certain parts of the application for scalability. For example, the “read what you write” consistency model in user reviews;
  • Learn Python to avoid common pitfalls. Of course, there are also reasons for analysis needs;
  • Customized open source software;
  • Optimize database queries; 
  • Make non-critical real-time tasks asynchronous.

Coding principlesCoding principles

They didn't waste time writing code to limit people. Instead, good engineering practices - coding conventions are adopted to improve the structure of its code.

--postscript-- 

In November 2006, Google acquired YouTube for $1.65 billion and operated it as a subsidiary. To this day, it remains the market leader in video sharing, with 5 billion video views per day.

According to Forbes, the founder of YouTube has a net worth of over $100 million. YouTube has become the leader in the video search industry only 20 months after its establishment, which can be said to have created a Silicon Valley miracle.

Reference link: https://newsletter.systemdesign.one/p/youtube-scalability