Exploring Databricks Community Edition — Part 1
Recently, I passed the Databricks Lakehouse Fundamental certification exam, which was a really nice conceptual introduction to Databricks Lakehouse. It’s completely free on Databricks Academy and consists of a series of short videos that you can watch to prepare for the 25 multiple-choice questions. I also wanted to get some hands-on experience with the Databricks platform, and in this article I walk through the Databricks Community Edition setup process with a bit overview of how Python and SQL are used in Databricks notebooks.
That said, I am still learning more about Databricks but really enjoyed this free playground edition that Databricks offers and wanted to share what I picked up from this experience. Also, if you’ve used Jupyter Notebook before, you’ll notice there’s quite a similarity between Databricks notebook and Jupyter notebook.
Setup up Databricks Community Edition
You can sign up for the free community edition below. It provides easy step-by-step instructions for setting up your free edition. You’ll see options to select a cloud provider, such as AWS, but you don’t need to worry about that; you can continue with the community edition.
A Glimpse into ‘Create’ in Databricks Platform
After creating a Databricks community edition, you can create a notebook, table, or cluster, which I’ll share details about very soon below.
Create a ‘Notebook’ in Databricks Platform
When creating a Databricks notebook, there’s an option to choose the default programming language, which is really awesome. In the example below, I also renamed the Databricks notebook to ‘Databricks Notebook — Data Exercise’.
Here, I chose Python, and just like you would in Jupyter Notebook, you can write a simple Python ‘Hello World’ script and run the cell to see the output. However, when you try to run the script, you must choose a cluster, but we haven’t created one yet. Below, we’ll create a ‘Cluster,’ which is required to run notebooks and automation jobs.
Create a ‘Cluster’ in Databricks Community Edition
In the beginning, when you clicked on ‘Create’ on the left-hand pane navigation menu, you can also choose “Cluster’ and the below screenshot shows what that cluster creation looks like. Seems like the Community Edition member gets a free 15 GB memory for the new compute which is plenty for simple data exercises.
Feel free to rename the Compute Resource as well — in the example below I called it “Cluster_Data_Practice”.
We can leave the ‘Databricks runtime version’ default option as is.
Click “Create compute”.
Run Python scripts in Databricks notebook
Returning to the Databricks notebook, we can begin running some Python scripts after connecting to the cluster we created earlier.
Create a ‘Table’ in Databricks Platform: Query Table in Notebook Using SQL after File Upload
I learned that you can also connect to a table or upload a data file as a table and use SQL in the Databricks notebook to query the data and perform data manipulation, which is really handy to do in the notebook for more complex data transformations. In the “Create” section from the left-hand menu bar, click on the ‘Table’ option:
Once you select ‘Table’, you’ll have options to choose a table from AWS S3 or other data sources (see below). For now, we’ll upload a fun Netflix movie data set that I downloaded from Kaggle.
Once you’ve uploaded the data file, you’ll need to choose the cluster we created previously and then select ‘Create Table with UI.’ Choose the cluster you originally created, then ‘Preview Table.’
You will see all the columns from the file, but make sure “First row is header” is selected if your column names don’t appear as they should.
You can also change the data types of the columns in ‘Table Preview’ and the name of the ‘Table Name’ as well.
Finally, click on ‘Create Table’.
Now, we can see our new Netflix ‘Table Name’ and query it using SQL in our Databricks notebook
In the Databricks notebook cell, make sure you select ‘SQL’ as the language before running your SQL script. When you type “SELECT * FROM,” it will show the table after entering the first few words of the table name.
Get Total Unique Movie Title Counts by Year from Netflix Table:
What’s really cool is that you can visually see the count as well. There is an option to pick ‘Visualization’ from the table preview below when you click on the plus ‘+’ sign. Once you choose your chart type and publish, then you’ll be able to choose between the table and visualization created from the query. That’s an awesome feature.
You can also publish your notebook, but you should know that it is public to everyone. Below is a link to my Data Exercise notebook from above.
And, that’s a wrap!
Really appreciate the time you took to read through this, and I hope it was helpful as you explore the Databricks Community Edition. I am still on a learning journey and am planning to write my next post about ETL and simple job automation scheduling using AWS S3. Below are some resources I think will be helpful as well.