Google data centers are very different from most conventional data centers and small-scale server farms. These differences present both problems and additional opportunities. This section discusses the challenges and opportunities that characterize Google’s data centers and introduces terminology used throughout the book.
Most Google computing resources are located in Google-designed data centers with proprietary power distribution, cooling, networks and computing (see [Bar13]). Unlike “standard” standard data centers, the computing hardware in Google’s data center is the same across the board. 9 To avoid confusion between server hardware and server software, we use the following terms in the book:
Piece of hardware (or maybe VM)
A piece of software that implements a service
Machines can run any server, so we don’t devote specific machines to specific server programs. There is no specific machine running our mail server, for example. Instead, resource allocation is managed by our cluster operating system, Screw.
We understand that using this word server is not routine. The common use of the word goes between a “binary that gets a network connection” and a machine, but the difference between the two is important when talking about Google computing. Once you get used to our server, it becomes clearer why it makes sense to use this professional terminology, not just Google but the rest of the book.
Figure 2-1 illustrates the Google Data Center topology:
Dozens of machines are left standing.
Racks are lined up.
One or more rows form a cluster.
Typically, a data center building contains multiple clusters.
Multiple central buildings located close to each other form a campus.
An example of the Google Campus Data Campus topology.
Figure 2-1. An example of the Google Campus Data Campus topology
Machines in a given data center need to be able to talk to each other, so we created a very fast virtual switch with tens of thousands of ports. We accomplished this by connecting hundreds of Google-built switches in a Clos [Clos53] network called Jupiter [Sin15]. In the largest configuration, Jupiter supports 1.3 Pbps of bandwidth between servers.
Data centers are interconnected through the spine network that spreads across our globe [Jai13]. B4 is a software-defined network architecture (and uses the OpenFlow standards open communication protocol). It provides massive bandwidth for a modest number of sites and uses elastic bandwidth allocation to maximize average bandwidth [Kum15].
System software that “organizes” the hardware
Our hardware must be controlled and managed by software that can handle large size. Hardware failures are one notable problem we manage with software. Given the large number of hardware components in the cluster, hardware failures occur quite often. In a single cluster in a typical year, thousands of failing machines and thousands of hard disks break; When you multiply the number of clusters we operate around the world, those numbers become somewhat breathtaking. That’s why we want to get users out of trouble like this, and the teams that run our services similarly don’t want to be harmed by hardware glitches. Each data center campus has teams dedicated to maintaining the hardware and center infrastructure.
The screw, illustrated in Figure 2-2, is a distributed cluster operating system [Ver15], similar to a horse apache. 10 Borg manages his work at the cluster level.
High-class screw cluster architecture.
Figure 2-2. High-class screw cluster architecture
Borg is responsible for running user jobs, which can run servers indefinitely or batch processes like MapReduce [Dea04]. Jobs can consist of more than (and sometimes thousands) of the same tasks, both for reliable reasons and because a single process cannot usually handle all clustering traffic. When a screw starts working it finds machines for tasks and tells machines to run the server program. Then, Borg constantly monitors these tasks. If a mission fails, it is killed and restarted, possibly on another computer.
Because tasks are fluidly allocated on machines, we cannot simply rely on IP addresses and port numbers to refer to tasks. We solve this problem with another indirect level: When work begins, Borg assigns a name and index number to each task through the Borg Naming (BNS) service. Instead of using the IP address and port number, other processes connect to Borg tasks through the BNS name, which is translated into an IP address and port number by BNS. For example, a BNS path may be a string such as / bns / <cluster> / <user> / <job name> / <task number>, which will resolve to <IP address>: <port>.
Borg is also responsible for allocating resources to jobs. Each job should specify its required resources (e.g., 3 processor cores, 2 gigabytes of RAM). Using the list of requirements for all jobs, a screw can pack the tasks on the machines in an optimal way that is also responsible for failure areas (for example: a screw will not perform all the tasks on the same rack, as this means that the top of the rack switch is a single failure point In that work).
If a task tries to use more resources than it has requested, a screw kills the task and restarts it (since a task that usually crashes slowly over a task that has not been restarted) is preferable.
Tasks can use the local disk in machines as a scratch area, but we have some cluster storage options for fixed storage (and even scratch space will eventually move to the cluster storage model). These are similar to Luster and Hadoop Distributed File System (HDFS), both of which are open source cluster file systems.
The storage layer is responsible for offering users easy and reliable access to the available clustering. As shown in Figure 2-3, for storing many layers:
The bottom layer is called D (for disk, though D uses both rotating disks and flash storage). D is a file server that runs on almost every machine