Skip to main content

Getting Started with Databricks

This guide will help you set up your Databricks account, create your first workspace, and run your first notebook.

Prerequisites

Before you begin, you'll need:

  • A cloud provider account (AWS, Azure, or Google Cloud)
  • A Databricks account (or sign up for Community Edition)
  • Basic knowledge of Python, SQL, or Scala
  • Understanding of big data concepts (helpful but not required)

Creating a Databricks Account

Option 1: Databricks Community Edition (Free)

The Community Edition is perfect for learning and experimentation:

  1. Visit https://community.cloud.databricks.com/
  2. Click "Sign Up"
  3. Enter your email and create a password
  4. Verify your email address
  5. You'll be automatically logged into your workspace

Limitations of Community Edition:

  • Single-node clusters only
  • Limited compute resources
  • Cannot use premium features
  • Data is not persistent beyond 30 days of inactivity

Option 2: Full Platform Trial

For production evaluation with all features:

AWS

  1. Visit https://databricks.com/try-databricks
  2. Select "AWS" as your cloud provider
  3. Sign up with your email
  4. Follow the setup wizard to create your workspace
  5. Link your AWS account (if required)

Azure

  1. Visit Azure Portal
  2. Search for "Azure Databricks"
  3. Click "Create" to create a new resource
  4. Fill in workspace details:
    • Subscription
    • Resource Group
    • Workspace Name
    • Region
    • Pricing Tier
  5. Click "Review + Create"

Google Cloud

  1. Visit Google Cloud Console
  2. Enable Databricks from the marketplace
  3. Follow the setup instructions
  4. Configure billing and permissions

Understanding the Workspace

Once logged in, you'll see the Databricks workspace interface with several key areas:

  • Workspace: Store notebooks, libraries, and folders
  • Repos: Git integration for version control
  • Data: Browse databases, tables, and files
  • Compute: Create and manage clusters
  • Workflows: Define and schedule jobs
  • Machine Learning: ML experiments and models

Top Navigation Bar

  • Search bar for quick access
  • Notifications
  • User settings
  • Help and documentation

Creating Your First Cluster

Clusters are compute resources that execute your code. Here's how to create one:

  1. Click "Compute" in the left sidebar
  2. Click "Create Cluster"
  3. Configure your cluster:
Cluster Name: my-first-cluster
Cluster Mode: Single Node (for learning) or Standard (for production)
Databricks Runtime Version: 13.3 LTS (Latest LTS recommended)
Node Type:
- Community Edition: Predefined
- Full Platform: Choose based on workload (e.g., m5.large for AWS)
Autopilot Options:
- Enable autoscaling: Yes
- Min Workers: 2
- Max Workers: 8
Terminate After: 120 minutes of inactivity
  1. Click "Create Cluster"
  2. Wait for the cluster to start (indicated by a green status)

Cluster Modes:

  • Single Node: For lightweight workloads and development
  • Standard: For distributed processing with multiple workers
  • High Concurrency: For sharing among multiple users with fine-grained resource allocation

Creating Your First Notebook

Notebooks are interactive documents where you write and execute code:

  1. Click "Workspace" in the left sidebar
  2. Navigate to your user folder
  3. Click the dropdown arrow next to your folder
  4. Select "Create" > "Notebook"
  5. Configure your notebook:
    • Name: "My First Notebook"
    • Default Language: Python
    • Cluster: Select your cluster
  6. Click "Create"

Running Your First Code

Let's write some simple code to verify everything works:

Example 1: Hello World

# In your notebook cell, type:
print("Hello, Databricks!")

# Press Shift + Enter to run the cell

Output:

Hello, Databricks!

Example 2: Create a DataFrame

# Create a simple DataFrame
from pyspark.sql import SparkSession

data = [
("Alice", 34, "Data Scientist"),
("Bob", 28, "Data Engineer"),
("Charlie", 31, "ML Engineer")
]

df = spark.createDataFrame(data, ["name", "age", "role"])
display(df)

This will show a formatted table with your data.

Example 3: SQL Query

# Register DataFrame as a temporary view
df.createOrReplaceTempView("employees")

# Switch to SQL cell (use %sql magic command)
%sql
SELECT name, age, role
FROM employees
WHERE age > 30
ORDER BY age DESC

Example 4: Read Data from Cloud Storage

# AWS S3 example
df_s3 = spark.read.csv("s3a://your-bucket/data.csv", header=True, inferSchema=True)

# Azure ADLS example
df_adls = spark.read.csv("abfss://container@account.dfs.core.windows.net/data.csv",
header=True, inferSchema=True)

# Display the data
display(df_s3.limit(10))

Notebook Features

Magic Commands

Databricks notebooks support magic commands for different languages:

%python  # Python code (default)
%sql # SQL queries
%scala # Scala code
%r # R code
%md # Markdown for documentation
%sh # Shell commands
%fs # Databricks File System commands

Visualization

Databricks provides built-in visualizations:

# Create sample data
import pandas as pd

sales_data = pd.DataFrame({
'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
'revenue': [10000, 12000, 15000, 13000, 17000, 20000]
})

df_sales = spark.createDataFrame(sales_data)
display(df_sales)

Click the chart icon above the results to create visualizations like:

  • Bar charts
  • Line charts
  • Pie charts
  • Scatter plots
  • Maps

Widgets for Parameters

Create interactive parameters in your notebook:

# Create a text widget
dbutils.widgets.text("user_name", "Guest", "Enter your name")

# Use the widget value
name = dbutils.widgets.get("user_name")
print(f"Hello, {name}!")

# Create a dropdown widget
dbutils.widgets.dropdown("environment", "dev", ["dev", "staging", "prod"])

env = dbutils.widgets.get("environment")
print(f"Running in {env} environment")

Working with Data

Databricks File System (DBFS)

DBFS is a distributed file system that comes with Databricks:

# List files in DBFS
%fs ls /

# Upload a file through UI: Data > DBFS > Upload File

# Read a file from DBFS
df = spark.read.csv("/FileStore/tables/mydata.csv", header=True, inferSchema=True)
display(df)

Creating Tables

# Create a managed table
df.write.format("delta").mode("overwrite").saveAsTable("my_table")

# Query the table
%sql
SELECT * FROM my_table LIMIT 10

Collaboration Features

Sharing Notebooks

  1. Open your notebook
  2. Click the "Share" button (top right)
  3. Add users or groups
  4. Set permissions:
    • Can Run: Execute notebook with read-only access
    • Can Edit: Modify and run the notebook
    • Can Manage: Full control including deletion

Comments and Collaboration

  • Click on any cell and press Cmd/Ctrl + Shift + M to add comments
  • Mention team members with @username
  • Resolve conversations when done

Version Control

  1. Click "Revision History" (top right)
  2. View all saved versions
  3. Compare changes between versions
  4. Restore previous versions if needed

Best Practices for Getting Started

1. Cluster Management

# Always terminate clusters when not in use
# Set auto-termination to avoid unnecessary costs
# Use auto-scaling for variable workloads

2. Notebook Organization

- Create folders for different projects
- Use meaningful notebook names
- Add markdown cells for documentation
- Include a README notebook in each folder

3. Code Structure

# Use clear variable names
# Add comments for complex logic
# Separate concerns into different cells
# Use functions for reusable code

def process_data(df, column_name):
"""Process dataframe by filtering and aggregating."""
return df.filter(df[column_name].isNotNull()).groupBy(column_name).count()

# Usage
result = process_data(df, "age")
display(result)

4. Performance Tips

# Cache DataFrames you'll use multiple times
df_cached = df.cache()

# Use persist for custom storage levels
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)

# Partition data appropriately
df.repartition(8).write.format("delta").save("/path/to/data")

Next Steps

Now that you've set up your workspace and created your first notebook, you can:

  1. Explore Notebooks - Learn advanced notebook features
  2. Dive into Data Engineering - Build data pipelines
  3. Try Machine Learning - Train your first model
  4. Review Best Practices - Production-ready patterns

Troubleshooting

Cluster Won't Start

  • Check your cloud provider quotas
  • Verify IAM permissions
  • Try a different instance type
  • Check the cluster event log for detailed errors

Cannot Access Data

  • Verify storage credentials are configured
  • Check IAM roles and policies
  • Ensure network connectivity (VPC peering, firewall rules)
  • Test with dbutils.fs.ls() commands

Notebook Errors

  • Verify cluster is attached and running
  • Check library installations
  • Review error messages in cell output
  • Check cluster logs for system errors

Useful Commands

# Display Databricks utilities help
dbutils.help()

# List available commands
dbutils.fs.help()
dbutils.secrets.help()
dbutils.widgets.help()

# Check Spark configuration
spark.sparkContext.getConf().getAll()

# View cluster information
sc = spark.sparkContext
print(f"Master: {sc.master}")
print(f"App Name: {sc.appName}")
print(f"Spark Version: {sc.version}")

Additional Resources