April 12, 2019

How to set up and work with Google DataLab

Step by step instructions on how to start a Google datalab project #

Starting a Google datalab project takes several steps and you are well-advised to set aside an hour to get through the process and understand the mechanics of it. At the end of that session, you will get a Python based Jupyter Lab style notebook that looks like Google Collaboratory running on the google compute clusters but something you can control using your computer and terminal commands. But there is a learning curve to it and several steps that needs to be negotiated step by step. While the quickstart guide that google offers is a good guide to get you started, not all that happens is documented there, so this is more or less a pleb’s or novice’s guide to set up the process.

Step 1: #

Think of a project name and an instance name. The project name can be anythig and you will need to set it up when you sign up for Google Cloud Platform. The project name will be associated with a project ID. You will need the project ID to get started.

Step 2: #

You need to enable billing and enable the APIs

Step 3: #

You need to use gcloud command and therefore install a terminal that can take ssh or gcloud commands. For Windows that would be PuTTY, and for Macs and Linux, your favourite terminal. I use z shell and that works well for me. I set it up on a linux box, so these instructions were executed on a linux terminal, but the instructions and steps are same for any operating system. If you use a Chromebook, or if you are like me where you cannot use Windows with Administrator previleges turned on, then you will need to use a Google Cloud Shell. Looks like Cloud shell is only good for short period of work with Datalab. OK, so for all purposes, use a terminal if you intend to keep the terminal open for a longer time period when you are learning or when you want to do long-ish analysis (as I tend to do).

Step 4: Install and initialise the Cloud SDK #

This is relatively straightforward. You will:

Download a zipped file from their website
Run the functions and remember to initialise the SDK
Instructions here

Step 5: Now the series of instructions #

(Assuming that you have installed Cloud SDK, and that you use a terminal), do:

$ gcloud components update # this will update the gcloud components
$ gcloud components install datalab # this will install datalab
# then, if you are doing it for the first time, you will need to login,
$ gcloud auth login # this will take you to the familiar Google log in windows on the browser
# if you are doing it for the first time, you will need to specify the project you are working
# Choose the project you want to work on
# if you visit your management console for google cloud (the one that opens after log in)
# you will find your project ID beneath your project name, so do:
# do not put your project name, but the id of the project
$ glcoud config set project project-id
# after this gets successfully completed, do
$ datalab create datalab-project-name 
# where 'datalab-project-name' is the name of the project you want to work with
# For example, I named my project as 'genome-analysis' because that's what I want to do
# This name will be your instance and you may need to stop and restart your instances
# if you want to save $$$ (otherwise it can get quite expensive depending on what you do)
# so in my case it was datalab create genome-analysis # no quote mark anywhere
# When you do this, it will ask you to select a zone. 
# Select the zone geographically closest to you, I selected Australia as I am in NZ
# You will see it will create several items, 'network', 'repository', 'instance', etc.

At that point, the programme will create network tunnel and will ask you create RSA key pair. It is a good idea to do so, if you will be working with real data. The prompt tells you that you will find your keys or you can manage your keys at:

https://cloud.google.com/compute/docs/instances/adding-removing-ssh-keys

At this point, you indicate a passphrase. I tend to use simple passphrases. At the terminal you will see a fingerprint starting with something like: SHA256:some-random-string-of-digits-letters @machine-name
That file is kept at ~/.ssh/google_compute_engine.pub

At this stage, it will tell you that the connection to the Datalab is now open and will remain open till the command is killed. This tells me that if I need to use the terminal, I must open another terminal and work or keep this terminal in another ‘window’ or ‘panel’.
Now, if you connect to:

http://localhost:8081/,

You will see a window that is minimalistic with jupyter notebooks and a readme.md file. It is a powerful engine behind a pretty basic minimalistic appearance to get your work done. You can download the file as html or as a python file. Your notebook will be saved as ipynb file or python notebook file. You can host this notebook elsewhere or you can use the markdown to write a text. Pretty barebones but enough to get you going by using Python.

If you do not want to go through these steps, you can use Google Colaboratory.

Kudos