First-time work experiences with VM instances at Google Cloud
Today, I will describe in detail what I faced with implementing my code at VM Instance. But, first, I need to mention that this is my first time working with VM Instance at Google Cloud.
As per my course — Advance Machine Learning(CSCI-6352), my course teacher, Dr. Dong-Chul Kim, motivated us to work with Google Cloud. From his course lecture, I got an initial idea about VM instances. And our Professor demonstrated the initial part of the setup VM instance.
After that, I tried and faced some problems. Then my coursemate, Ashraful Islam, helped me to overcome some problems. So let’s see what I faced from the beginning.
1st attempt: I started to go ahead as my Professor lecture. 1st of all, I went from console.cloud.google.com, then from the left side menu Compute Engine > VM instances. Then I picked Create Instance. Gave a name of the Instance. Then I picked asia-southeast1 (Singapore), zone: asia-southeast1-b. Then, I picked N1 (Powered by Intel Skylake CPU platform or one of its predecessors), Machine type: n1-standard-1 (1 vCPU, 3.75 GB memory). Boot disk — Operating system — Ubuntu, Version: Ubuntu 18.04 LTS and Boot disk type — Balanced persistent disk and size: 100 GB, Firewall — allow HTTP and HTTPS. GPUs T4 and number — 1(You have to send a request to Google Could to increase limit)
Then set up my environment. I used Jupyter Notebook. I didn’t use and container.
Then I created an instance. Now I will point out what I made an error with the first setup.
- I couldn’t run my code because it was 8 GB data set, and I picked 3.75 GB as memory, so after 1 epochs of my code, server shut down!
- I got this error “The kernel appears to have died. It will restart automatically.”
How did I solve it?
I tied to get help from StackOverflow, but I didn’t get any success. Then I searched over google and a medium blog of Mr. Cedric Chee (https://medium.com/@cedric_chee/just-sharing-in-case-anyone-run-into-the-same-problem-775b92bf660a). Then I changed memory to 15 GB. I didn’t know that I could edit memory; I just deleted it and created a new one.
This time, I could run my code up to 10 epochs then I faced some new issues.
“Connection Failed: We are unable to connect to the VM on port 22. Learn more about possible causes of this issue”
“ssh: connect to host 000.000.18.100 port 22: Connection refused
ERROR: (gcloud.compute.ssh) [/usr/bin/ssh] exited with return code .”
Till this time, I already passed 32 hours of working! I couldn’t trace out what happened. Then I knocked on my friend Ashikur Rahman Ashik, a server expert, and he has been working with me since 2014 as a soulmate. I didn’t know that he also works with Google Cloud.
He checked everything he noticed that I had some problems with the firewall. He fixed it, and I compiled my code, and I faced the same issue. Then we decided to delete this instance.
This time, suddenly, I have noticed that He added a firewall on the Protocols and ports section, tcp: 8888 and for another firewall, it’s 22 and for both cases: Source IP ranges: 0.0.0.0/0. So he just added two firewalls on two different names with two different Protocols and ports.
Problems that we faced:
And on installation time, we noticed that we couldn’t set up Tensorflow 2.6.0 version with Ubuntu 18.04. So I tried to set it up by specifying the version manually. But We couldn’t. Then we decided to delete this instance too.
This time, we used Ubuntu 20.04 LTS and my friend, Ashik, recommended to use SSD persistent disk.
So we set up our environment and Set up all. And this time, to upload data, we used Kaggle API, so I didn’t upload a zip file from my PC.
And after this time, about 24 hours later, I successfully compiled my whole code with output.
And this is the story of the first implementation at VM Instance with Google Cloud.