File size: 15,386 Bytes

17e2002

<h4><strong>Why GPU's are better for Deep Learning</strong> </h4><p><a href="https://www.quora.com/Why-are-GPUs-well-suited-to-deep-learning" rel="noopener noreferrer" target="_blank">https://www.quora.com/Why-are-GPUs-well-suited-to-deep-learning</a></p><p><br></p><p>As many have said GPUs are so fast because they are so efficient for matrix multiplication and convolution, but nobody gave a real explanation for why this is so. The real reason for this is memory bandwidth and not necessarily parallelism.</p><p>First of all, you have to understand that CPUs are latency optimized while GPUs are bandwidth optimized. You can visualize this as a CPU being a Ferrari and a GPU being a big truck. The task of both is to pick up packages from a random location A and to transport those packages to another random location B. The CPU (Ferrari) can fetch some memory (packages) in your RAM quickly while the GPU (big truck) is slower in doing that (much higher latency). However, the CPU (Ferrari) needs to go back and forth many times to do its job (location A $\rightarrow$ pick up 2 packages $\rightarrow$ location B ... repeat) while the GPU can fetch much more memory at once (location A $\rightarrow$ pick up 100 packages $\rightarrow$ location B ... repeat).</p><p>So, in other words, the CPU is good at fetching small amounts of memory quickly (5 * 3 * 7) while the GPU is good at fetching large amounts of memory (Matrix multiplication: (A*B)*C). The best CPUs have about 50GB/s while the best GPUs have 750GB/s memory bandwidth. So the more memory your computational operations require, the more significant the advantage of GPUs over CPUs. But there is still the latency that may hurt performance in the case of the GPU. A big truck may be able to pick up a lot of packages with each tour, but the problem is that you are waiting a long time until the next set of packages arrives. Without solving this problem, GPUs would be very slow even for large amounts of data. So how is this solved?</p><p>If you ask a big truck to make many tours to fetch packages you will always wait for a long time for the next load of packages once the truck has departed to do the next tour — the truck is just slow. However, if you now use a fleet of either Ferraris and big trucks (thread parallelism), and you have a big job with many packages (large chunks of memory such as matrices) then you will wait for the first truck a bit, but after that you will have no waiting time at all — unloading the packages takes so much time that all the trucks will queue in unloading location B so that you always have direct access to your packages (memory). This effectively hides latency so that GPUs offer high bandwidth while hiding their latency under thread parallelism — so for large chunks of memory GPUs provide the best memory bandwidth while having almost no drawback due to latency via thread parallelism. This is the second reason why GPUs are faster than CPUs for deep learning. As a side note, you will also see why more threads do not make sense for CPUs: A fleet of Ferraris has no real benefit in any scenario.</p><p>But the advantages for the GPU do not end here. This is the first step where the memory is fetched from the main memory (RAM) to the local memory on the chip (L1 cache and registers). This second step is less critical for performance but still adds to the lead for GPUs. All computation that ever is executed happens in registers which are directly attached to the execution unit (a core for CPUs, a stream processor for GPUs). Usually, you have the fast L1 and register memory very close to the execution engine, and you want to keep these memories small so that access is fast. Increased distance to the execution engine dramatically reduces memory access speed, so the larger the distance to access it the slower it gets. If you make your memory larger and larger, then, in turn, it gets slower to access its memory (on average, finding what you want to buy in a small store is faster than finding what you want to buy in a huge store, even if you know where that item is). So the size is limited for register files - we are just at the limits of physics here and every nanometer counts, we want to keep them small.</p><p>The advantage of the GPU is here that it can have a small pack of registers for every processing unit (stream processor, or SM), of which it has many. Thus we can have in total a lot of register memory, which is very small and thus very fast. This leads to the aggregate GPU registers size being more than 30 times larger compared to CPUs and still twice as fast which translates to up to 14MB register memory that operates at a whopping 80TB/s. As a comparison, the CPU L1 cache only operates at about 5TB/s which is quite slow and has the size of roughly 1MB; CPU registers usually have sizes of around 64-128KB and operate at 10-20TB/s. Of course, this comparison of numbers is a bit flawed because registers operate a bit differently than GPU registers (a bit like apples and oranges), but the difference in size here is more crucial than the difference in speed, and it does make a difference.</p><p>As a side note, full register utilization in GPUs seems to be difficult to achieve at first because it is the smallest unit of computation which needs to be fine-tuned by hand for good performance. However, NVIDIA has developed helpful compiler tools which indicate when you are using too much or too few registers per stream processor. It is easy to tweak your GPU code to make use of the right amount of registers and L1 cache for fast performance. This gives GPUs an advantage over other architectures like Xeon Phis where this utilization is complicated to achieve and painful to debug which in the end makes it difficult to maximize performance on a Xeon Phi.</p><p>What this means, in the end, is that you can store a lot of data in your L1 caches and register files on GPUs to reuse convolutional and matrix multiplication tiles. For example the best matrix multiplication algorithms use 2 tiles of 64x32 to 96x64 numbers for 2 matrices in L1 cache, and a 16x16 to 32x32 number register tile for the outputs sums per thread block (1 thread block = up to 1024 threads; you have 8 thread blocks per stream processor, there are 60 stream processors in total for the entire GPU). If you have a 100MB matrix, you can split it up in smaller matrices that fit into your cache and registers, and then do matrix multiplication with three matrix tiles at speeds of 10-80TB/s — that is fast! This is the third reason why GPUs are so much faster than CPUs, and why they are so well suited for deep learning.</p><p>Keep in mind that the slower memory always dominates performance bottlenecks. If 95% of your memory movements take place in registers (80TB/s), and 5% in your main memory (0.75TB/s), then you still spend most of the time on memory access of main memory (about six times as much).</p><p>Thus in order of importance: (1) High bandwidth main memory, (2) hiding memory access latency under thread parallelism, and (3) large and fast register and L1 memory which is easily programmable are the components which make GPUs so well suited for deep learning.</p><p>However, building your own deep learning rig is a pricey affair. Factor in costs of a fast and powerful GPU, CPU, SSD, compatible motherboard and power supply, air-conditioning bills, maintenance and damage to components. On top of it, you run the risk of falling behind on the latest hardware in this rapidly advancing industry.</p><p>Moreover, just assembling the components is not enough. You need to setup all the required libraries and compatible drivers before you can start training your first model. <a href="https://medium.com/towards-data-science/building-your-own-deep-learning-box-47b918aea1eb" rel="noopener noreferrer" target="_blank">People</a> still go along this route, and if you plan to use deep learning extensively (&gt;150 hrs/mo), <a href="https://medium.com/towards-data-science/build-a-deep-learning-rig-for-800-4434e21a424f" rel="noopener noreferrer" target="_blank">building your own deep learning workstation</a> might be the right move.</p><p>A better and cheaper alternative is to use cloud-based GPU servers provided by the likes of Amazon, Google, Microsoft and others, especially if you are just breaking into this domain and plan to use the computing power for learning and experimenting. I have been using AWS, Paperspace and FloydHub for the past 4–5 months. Google Cloud Platform and Microsoft Azure were similar to AWS in their pricing and offerings, hence, I stuck to the previously mentioned three.</p><p><br></p><p>However, the reason you WANT to use Cloud GPUs is because GPU Architecture</p><p><img src="https://cdn-images-1.medium.com/max/1200/1*kG_TBX339Kv-s4QtJuDpFg.png"></p><p>Source - <a href="https://www.nvidia.com/en-us/data-center/tesla-v100/" rel="noopener noreferrer" target="_blank">https://www.nvidia.com/en-us/data-center/tesla-v100/</a></p><p><br></p><h4><strong>Now that we know we want to use GPUs - Sign up and Create a Gradient Notebook with Keras and TensorFlow pre-installed</strong></h4><p><br></p><p>In my opinion <strong>PaperSpace's </strong>Gradient containers are some of quickest and simplest ways to started using a cloud GPU. In a matter of minutes you'll be training a CNN!</p><p>Of course you're free to use other Cloud GPU services, AWS, Azure, Floydhub, Google Cloud (GPUs and TPUs), Vast.AI and many others.</p><p>Here's an entite list of Cloud GPU providers - <a href="https://towardsdatascience.com/list-of-deep-learning-cloud-service-providers-579f2c769ed6" rel="noopener noreferrer" target="_blank">https://towardsdatascience.com/list-of-deep-learning-cloud-service-providers-579f2c769ed6</a></p><p><br></p><h4><strong>Using PaperSpace's GPU and Gradient Notebooks</strong></h4><p><strong>Step 1</strong></p><p>Go to <a href="http://www.paperspace.com" rel="noopener noreferrer" target="_blank">www.paperspace.com</a></p><p><strong>Step 2</strong></p><p>Sign up using your GitHub account or a regular email and password:</p><figure><img height="548" src="https://udemy-images.s3.amazonaws.com:443/redactor/raw/2019-04-16_21-01-28-2731b9ed9c55a05ef3173c51e167b280.jpg" width="772"></figure><p><br></p><p><strong>Step 3A - Adding Credit Card and using my Referral Code to get $10 Free Credit</strong></p><p>Verify your email and sign in to PaperSpace, you'll be greeted by this page. It may not immediately appear, but notice <strong>Error Alert box in pink below (see yellow box I drew over it)</strong>.</p><figure><img height="669" src="https://udemy-images.s3.amazonaws.com:443/redactor/raw/2019-04-16_21-20-58-1f3fa9e8e3520ee03e0b3c210236ca08.jpg" width="776"></figure><p><br></p><p><strong>STEP 3B - Go to the Billing Page</strong></p><figure><img height="664" src="https://udemy-images.s3.amazonaws.com:443/redactor/raw/2019-04-16_21-24-18-8946a909736e325b2a3ea3d1bd664c59.jpg" width="770"></figure><p><br></p><p><strong>Step 3C</strong> - Enter Credit Card Info and Referral code </p><h4><strong>Get $5 Free Credit by using Referral Code - 2DSSNCI</strong></h4><p><strong>or click below</strong></p><p><a href="https://paperspace.io/&amp;R=2DSSNCI" rel="noopener noreferrer" target="_blank">https://paperspace.io/&amp;R=2DSSNCI</a></p><p>This gives you ~10 hours of usage on a P4000 GPU</p><p>After the user must enter their credit card information, otherwise you won't be able to launch a gradient machine<img height="711" src="https://udemy-images.s3.amazonaws.com:443/redactor/raw/2019-04-16_21-25-51-fbc69be7fa153b677f79199325f8bc3a.jpg" width="813"></p><p><br></p><p><strong>Step 4A - Creating a Gradient Notebook</strong></p><p>Go back to the <strong>Home </strong>landing page and click on the Gradient Box shown below.</p><figure><img height="704" src="https://udemy-images.s3.amazonaws.com:443/redactor/raw/2019-04-16_21-41-53-11d27f6f8a2ce11310545188c0281cfc.jpg" width="796"></figure><p><br></p><p><strong>Step 4B</strong> - Click on box in yellow to create your first notebook</p><figure><img height="716" src="https://udemy-images.s3.amazonaws.com:443/redactor/raw/2019-04-16_21-41-53-0c7321f58c7d5390e46b0bd13f3d1130.jpg" width="816"></figure><p><strong>Step 4C</strong> -Under the<strong> Public Containers Tab</strong> (should be shown by default), click on the u<strong>foym/deepo:all-py36-jupyter</strong> container box (shown in the yellow box below).</p><p>This container contains many essential Deep Learning Libraries including Keras and TensorFlow. The main libraries included are:</p><ul><li><p>darknet latest (git)</p></li><li><p>python 3.6 (apt)</p></li><li><p>torch latest (git)</p></li><li><p>chainer latest (pip)</p></li><li><p>jupyter latest (pip)</p></li><li><p>mxnet latest (pip)</p></li><li><p>onnx latest (pip)</p></li><li><p>pytorch latest (pip)</p></li><li><p>tensorflow latest (pip)</p></li><li><p>theano latest (git)</p></li><li><p>keras latest (pip)</p></li><li><p>lasagne latest (git)</p></li><li><p>opencv 4.0.1 (git)</p></li><li><p>sonnet latest (pip)</p></li><li><p>caffe latest (git)</p></li><li><p>cntk latest (pip)</p></li></ul><figure><img height="466" src="https://udemy-images.s3.amazonaws.com:443/redactor/raw/2019-04-17_04-19-17-440063cb83a37d599dcbc4f53114dc70.jpg" width="762"></figure><p><strong>Step 4D </strong>- Select the following instance, the <strong>P4000</strong>. Note feel free to select other more expensive GPUs, but as a cost per value system, the P4000 is excellent. </p><figure><img height="727" src="https://udemy-images.s3.amazonaws.com:443/redactor/raw/2019-04-16_22-22-15-46c43d464f8855522ce902c70489b44b.jpg" width="767"></figure><p><strong>Step 4E</strong> - Name your Notebook (or keep the default name, it's your choice). I chose<em> "DL CV P4000"</em>.</p><p><strong>Next </strong>you should see the Create Notebook button below in green, click that to create your notebook!</p><p><strong>Note</strong>: You can ignore the 04. Auto-Shutdown are for now, but remember to shutdown your notebook after use (will show you how soon) so that your credit does not run down.</p><figure><img height="690" src="https://udemy-images.s3.amazonaws.com:443/redactor/raw/2019-04-16_21-41-53-180261afb3d050de2477dee1716e0e84.jpg" width="748"></figure><p><strong><br>Step 5 - Launching your notebook</strong></p><p>Your notebook will now show up in the Notebook section shown below.</p><p>Click the greyed out <strong>Start </strong>button (located in the Actions column) to boot it up.</p><figure><img height="509" src="https://udemy-images.s3.amazonaws.com:443/redactor/raw/2019-04-16_22-30-22-6d05506ee36de6474990c673cde31048.jpg" width="764"></figure><p>This window will now appear, click <strong>Start Notebook </strong>below. </p><figure><img height="812" src="https://udemy-images.s3.amazonaws.com:443/redactor/raw/2019-04-16_22-30-22-b6600eb8d9457dce80b1655c85ed081d.jpg" width="796"></figure><p>Status will change from <strong>Stopped </strong>to <strong>Pending </strong>-&gt; <strong>Provisioned </strong>-&gt; <strong>Running</strong></p><p>Once it's running you'll be able to launch it by pressing <strong>Open</strong></p><figure><img height="262" src="https://udemy-images.s3.amazonaws.com:443/redactor/raw/2019-04-17_04-24-00-a14237a295cc12888557706d1ec983e7.jpg" width="766"></figure>