Data centers running artificial intelligence (AI) will be significantly more efficient than those operating with hand-edited algorithm schedules, say experts at MIT. The researchers there say they have developed an automated scheduler that speeds cluster jobs by up to 20 or 30 percent, and even faster (2x) in peak periods.\nThe school\u2019s AI job scheduler works on a type of AI called \u201creinforcement learning\u201d (RL). That\u2019s a trial-and-error-based machine-learning method that modifies scheduling decisions depending on actual workloads in a specific cluster. AI, when done right, could supersede the current state-of-the-art method, which is algorithms. They often must be fine-tuned by humans, introducing inefficiency.\n\u201cThe system could enable data centers to handle the same workload at higher speeds, using fewer resources,\u201d the school says in a news article related to the tech. The MIT researchers say the data center-adapted form of RL could revolutionize operations.\n\n\u201cIf you have a way of doing trial and error using machines, they can try different ways of scheduling jobs and automatically figure out which strategy is better than others,\u201d says Hongzi Mao, a student in the university\u2019s Department of Electrical Engineering and Computer Science, in the article. \u201cAny slight improvement in utilization, even 1%, can save millions of dollars and a lot of energy.\u201d\nWhat's wrong with today's data center algorithms\nThe problem with the current algorithms for running tasks on thousands of servers at the same time is that they\u2019re not very efficient. Theoretically, they should be, but because workloads (combinations of tasks) are varied, humans get involved in tweaking the performance\u2014a resource, say, might need to be shared between jobs, or some jobs might need to be performed faster than others\u2014but humans can\u2019t handle the range or scope of the edits; the job is just too big.\nUnfathomable permutations for humans in the manually edited scheduling can include the fact that a lower node (smaller computational task) can\u2019t start work until an upper node (larger, more power-requiring computational task) has completed its work. It gets highly complicated allocating the computational resources, the scientists explain.\nDecima, MIT\u2019s system, can process dynamic graphs (representations) of nodes and edges (edges connect nodes, linking tasks), the school says. That hasn\u2019t been possible before with RL because RL hasn\u2019t been able to understand the graphs well enough at scale.\n\u201cTraditional RL systems are not accustomed to processing such dynamic graphs,\u201d MIT says.\nMIT\u2019s graph-oriented AI is different than other forms of AI that are more commonly used with images. Robots, for example, learn the difference between objects in different scenarios by processing images and getting reward signals when they get it right.\nSimilar, though, to presenting images to robots, workloads in the Decima system are mimicked until the system, through the receipt of AI reward signals, improves its decisions. A special kind of baselining (comparison to history) then helps Decima figure out which actions are good and which ones are bad, even when the workload sequences only supply poor reward signals due to the complication of the job structures slowing everything down. That baselining is a key differentiator in the MIT system.\n\u201cDecima can find opportunities for [scheduling] optimization that are simply too onerous to realize via manual design\/tuning processes,\u201d says Aditya Akella, a professor at University of Wisconsin at Madison, in the MIT article. The team there has developed a number of high-performance schedulers. \u201cDecima can go a step further,\u201d Akella says.