Getting Started with Databricks
This guide will help you set up your Databricks account, create your first workspace, and run your first notebook.
Prerequisites
Before you begin, you'll need:
- A cloud provider account (AWS, Azure, or Google Cloud)
- A Databricks account (or sign up for Community Edition)
- Basic knowledge of Python, SQL, or Scala
- Understanding of big data concepts (helpful but not required)
Creating a Databricks Account
Option 1: Databricks Community Edition (Free)
The Community Edition is perfect for learning and experimentation:
- Visit https://community.cloud.databricks.com/
- Click "Sign Up"
- Enter your email and create a password
- Verify your email address
- You'll be automatically logged into your workspace
Limitations of Community Edition:
- Single-node clusters only
- Limited compute resources
- Cannot use premium features
- Data is not persistent beyond 30 days of inactivity
Option 2: Full Platform Trial
For production evaluation with all features:
AWS
- Visit https://databricks.com/try-databricks
- Select "AWS" as your cloud provider
- Sign up with your email
- Follow the setup wizard to create your workspace
- Link your AWS account (if required)
Azure
- Visit Azure Portal
- Search for "Azure Databricks"
- Click "Create" to create a new resource
- Fill in workspace details:
- Subscription
- Resource Group
- Workspace Name
- Region
- Pricing Tier
- Click "Review + Create"
Google Cloud
- Visit Google Cloud Console
- Enable Databricks from the marketplace
- Follow the setup instructions
- Configure billing and permissions
Understanding the Workspace
Once logged in, you'll see the Databricks workspace interface with several key areas:
Left Sidebar Navigation
- Workspace: Store notebooks, libraries, and folders
- Repos: Git integration for version control
- Data: Browse databases, tables, and files
- Compute: Create and manage clusters
- Workflows: Define and schedule jobs
- Machine Learning: ML experiments and models
Top Navigation Bar
- Search bar for quick access
- Notifications
- User settings
- Help and documentation
Creating Your First Cluster
Clusters are compute resources that execute your code. Here's how to create one:
- Click "Compute" in the left sidebar
- Click "Create Cluster"
- Configure your cluster:
Cluster Name: my-first-cluster
Cluster Mode: Single Node (for learning) or Standard (for production)
Databricks Runtime Version: 13.3 LTS (Latest LTS recommended)
Node Type:
- Community Edition: Predefined
- Full Platform: Choose based on workload (e.g., m5.large for AWS)
Autopilot Options:
- Enable autoscaling: Yes
- Min Workers: 2
- Max Workers: 8
Terminate After: 120 minutes of inactivity
- Click "Create Cluster"
- Wait for the cluster to start (indicated by a green status)
Cluster Modes:
- Single Node: For lightweight workloads and development
- Standard: For distributed processing with multiple workers
- High Concurrency: For sharing among multiple users with fine-grained resource allocation
Creating Your First Notebook
Notebooks are interactive documents where you write and execute code:
- Click "Workspace" in the left sidebar
- Navigate to your user folder
- Click the dropdown arrow next to your folder
- Select "Create" > "Notebook"
- Configure your notebook:
- Name: "My First Notebook"
- Default Language: Python
- Cluster: Select your cluster
- Click "Create"
Running Your First Code
Let's write some simple code to verify everything works:
Example 1: Hello World
# In your notebook cell, type:
print("Hello, Databricks!")
# Press Shift + Enter to run the cell
Output:
Hello, Databricks!
Example 2: Create a DataFrame
# Create a simple DataFrame
from pyspark.sql import SparkSession
data = [
("Alice", 34, "Data Scientist"),
("Bob", 28, "Data Engineer"),
("Charlie", 31, "ML Engineer")
]
df = spark.createDataFrame(data, ["name", "age", "role"])
display(df)
This will show a formatted table with your data.
Example 3: SQL Query
# Register DataFrame as a temporary view
df.createOrReplaceTempView("employees")
# Switch to SQL cell (use %sql magic command)
%sql
SELECT name, age, role
FROM employees
WHERE age > 30
ORDER BY age DESC
Example 4: Read Data from Cloud Storage
# AWS S3 example
df_s3 = spark.read.csv("s3a://your-bucket/data.csv", header=True, inferSchema=True)
# Azure ADLS example
df_adls = spark.read.csv("abfss://container@account.dfs.core.windows.net/data.csv",
header=True, inferSchema=True)
# Display the data
display(df_s3.limit(10))
Notebook Features
Magic Commands
Databricks notebooks support magic commands for different languages:
%python # Python code (default)
%sql # SQL queries
%scala # Scala code
%r # R code
%md # Markdown for documentation
%sh # Shell commands
%fs # Databricks File System commands
Visualization
Databricks provides built-in visualizations:
# Create sample data
import pandas as pd
sales_data = pd.DataFrame({
'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
'revenue': [10000, 12000, 15000, 13000, 17000, 20000]
})
df_sales = spark.createDataFrame(sales_data)
display(df_sales)
Click the chart icon above the results to create visualizations like:
- Bar charts
- Line charts
- Pie charts
- Scatter plots
- Maps
Widgets for Parameters
Create interactive parameters in your notebook:
# Create a text widget
dbutils.widgets.text("user_name", "Guest", "Enter your name")
# Use the widget value
name = dbutils.widgets.get("user_name")
print(f"Hello, {name}!")
# Create a dropdown widget
dbutils.widgets.dropdown("environment", "dev", ["dev", "staging", "prod"])
env = dbutils.widgets.get("environment")
print(f"Running in {env} environment")
Working with Data
Databricks File System (DBFS)
DBFS is a distributed file system that comes with Databricks:
# List files in DBFS
%fs ls /
# Upload a file through UI: Data > DBFS > Upload File
# Read a file from DBFS
df = spark.read.csv("/FileStore/tables/mydata.csv", header=True, inferSchema=True)
display(df)
Creating Tables
# Create a managed table
df.write.format("delta").mode("overwrite").saveAsTable("my_table")
# Query the table
%sql
SELECT * FROM my_table LIMIT 10
Collaboration Features
Sharing Notebooks
- Open your notebook
- Click the "Share" button (top right)
- Add users or groups
- Set permissions:
- Can Run: Execute notebook with read-only access
- Can Edit: Modify and run the notebook
- Can Manage: Full control including deletion
Comments and Collaboration
- Click on any cell and press Cmd/Ctrl + Shift + M to add comments
- Mention team members with @username
- Resolve conversations when done
Version Control
- Click "Revision History" (top right)
- View all saved versions
- Compare changes between versions
- Restore previous versions if needed
Best Practices for Getting Started
1. Cluster Management
# Always terminate clusters when not in use
# Set auto-termination to avoid unnecessary costs
# Use auto-scaling for variable workloads
2. Notebook Organization
- Create folders for different projects
- Use meaningful notebook names
- Add markdown cells for documentation
- Include a README notebook in each folder
3. Code Structure
# Use clear variable names
# Add comments for complex logic
# Separate concerns into different cells
# Use functions for reusable code
def process_data(df, column_name):
"""Process dataframe by filtering and aggregating."""
return df.filter(df[column_name].isNotNull()).groupBy(column_name).count()
# Usage
result = process_data(df, "age")
display(result)
4. Performance Tips
# Cache DataFrames you'll use multiple times
df_cached = df.cache()
# Use persist for custom storage levels
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)
# Partition data appropriately
df.repartition(8).write.format("delta").save("/path/to/data")
Next Steps
Now that you've set up your workspace and created your first notebook, you can:
- Explore Notebooks - Learn advanced notebook features
- Dive into Data Engineering - Build data pipelines
- Try Machine Learning - Train your first model
- Review Best Practices - Production-ready patterns
Troubleshooting
Cluster Won't Start
- Check your cloud provider quotas
- Verify IAM permissions
- Try a different instance type
- Check the cluster event log for detailed errors
Cannot Access Data
- Verify storage credentials are configured
- Check IAM roles and policies
- Ensure network connectivity (VPC peering, firewall rules)
- Test with dbutils.fs.ls() commands
Notebook Errors
- Verify cluster is attached and running
- Check library installations
- Review error messages in cell output
- Check cluster logs for system errors
Useful Commands
# Display Databricks utilities help
dbutils.help()
# List available commands
dbutils.fs.help()
dbutils.secrets.help()
dbutils.widgets.help()
# Check Spark configuration
spark.sparkContext.getConf().getAll()
# View cluster information
sc = spark.sparkContext
print(f"Master: {sc.master}")
print(f"App Name: {sc.appName}")
print(f"Spark Version: {sc.version}")