Working with Notebooks
Notebooks are the primary interface for interactive data analysis and development in Databricks. They combine code, visualizations, and narrative text in a single document.
Notebook Basics
Creating and Managing Notebooks
Create a New Notebook:
Workspace → Your Folder → Create → Notebook
Import Existing Notebooks:
- From local files (
.ipynb,.dbc,.html) - From Git repositories
- From other Databricks workspaces
Export Notebooks:
# Via UI: File → Export → Select format
# Formats: Source File, HTML, DBC Archive, IPython Notebook
Notebook Languages
Databricks notebooks support multiple languages:
# Python (default)
df = spark.read.csv("/data/sample.csv")
display(df)
-- SQL
%sql
SELECT * FROM my_table
WHERE date >= '2024-01-01'
LIMIT 100
// Scala
%scala
val df = spark.read.format("delta").load("/data/delta-table")
df.show()
# R
%r
library(SparkR)
df <- read.df("/data/sample.parquet", "parquet")
head(df)
Magic Commands
Magic commands change the execution context:
%python # Switch to Python
%sql # Switch to SQL
%scala # Switch to Scala
%r # Switch to R
%md # Markdown for documentation
%sh # Shell commands
%fs # Databricks file system operations
%pip # Install Python packages
%run # Run another notebook
Advanced Notebook Features
Cell Operations
Keyboard Shortcuts:
Shift + Enter # Run cell and move to next
Ctrl/Cmd + Enter # Run cell and stay
Esc + A # Insert cell above
Esc + B # Insert cell below
Esc + D D # Delete cell
Esc + M # Convert to markdown
Esc + Y # Convert to code
Ctrl/Cmd + / # Comment/uncomment lines
Cell Actions:
# Run all cells
# Notebook menu → Run All
# Clear all results
# Notebook menu → Clear → All Results
# Collapse cells for better organization
# Click arrow icon next to cell number
Markdown and Documentation
Create rich documentation with markdown:
%md
# Main Title
## Section Heading
### Subsection
**Bold text** and *italic text*
- Bullet point 1
- Bullet point 2
- Nested bullet
1. Numbered list
2. Second item
`inline code`
```python
# Code block
def hello():
print("Hello, World!")

| Column 1 | Column 2 | Column 3 |
|---|---|---|
| Data 1 | Data 2 | Data 3 |
Blockquote for important notes
Horizontal rule
### Notebook Widgets
Widgets create interactive parameters for notebooks:
#### Text Widget
```python
# Create a text input widget
dbutils.widgets.text("database_name", "default", "Database")
# Get the widget value
db_name = dbutils.widgets.get("database_name")
print(f"Using database: {db_name}")
# Use in SQL
spark.sql(f"USE {db_name}")
Dropdown Widget
# Create a dropdown widget
dbutils.widgets.dropdown("environment", "dev", ["dev", "staging", "prod"], "Environment")
env = dbutils.widgets.get("environment")
# Use in conditional logic
if env == "prod":
table_name = "production.sales"
else:
table_name = f"{env}.sales"
print(f"Using table: {table_name}")
Multiselect Widget
# Create a multiselect widget
dbutils.widgets.multiselect("regions", "US", ["US", "EU", "APAC", "LATAM"], "Regions")
# Get selected values (comma-separated string)
selected_regions = dbutils.widgets.get("regions").split(",")
print(f"Selected regions: {selected_regions}")
# Use in query
region_filter = "', '".join(selected_regions)
query = f"""
SELECT * FROM sales
WHERE region IN ('{region_filter}')
"""
Combobox Widget
# Create a combobox widget (dropdown with custom input)
dbutils.widgets.combobox("limit", "100", ["10", "100", "1000", "10000"], "Row Limit")
limit_value = int(dbutils.widgets.get("limit"))
df = spark.table("my_table").limit(limit_value)
display(df)
Remove Widgets
# Remove a specific widget
dbutils.widgets.remove("database_name")
# Remove all widgets
dbutils.widgets.removeAll()
Notebook Workflows
Running Other Notebooks
# Run another notebook and get its return value
result = dbutils.notebook.run(
"/path/to/notebook",
timeout_seconds=600,
arguments={"param1": "value1", "param2": "value2"}
)
print(f"Notebook returned: {result}")
In the called notebook:
# Get parameters passed from parent notebook
param1 = dbutils.widgets.get("param1")
param2 = dbutils.widgets.get("param2")
# Do some processing
result = f"Processed {param1} and {param2}"
# Return value to parent notebook
dbutils.notebook.exit(result)
Notebook Pipeline Pattern
# Master notebook orchestrating a pipeline
def run_pipeline():
# Step 1: Data ingestion
dbutils.notebook.run("/pipelines/01_ingest_data", 600, {
"source_path": "/data/raw/",
"target_table": "bronze.raw_data"
})
# Step 2: Data transformation
dbutils.notebook.run("/pipelines/02_transform_data", 1200, {
"source_table": "bronze.raw_data",
"target_table": "silver.clean_data"
})
# Step 3: Aggregation
result = dbutils.notebook.run("/pipelines/03_aggregate_data", 900, {
"source_table": "silver.clean_data",
"target_table": "gold.aggregated_data"
})
print(f"Pipeline completed: {result}")
run_pipeline()
Data Visualization
Built-in Visualizations
Databricks provides rich visualization options:
# Create sample data
import pandas as pd
# Sales data
data = pd.DataFrame({
'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
'revenue': [45000, 52000, 48000, 61000, 58000, 67000],
'expenses': [35000, 38000, 36000, 42000, 40000, 45000],
'customers': [120, 145, 132, 178, 165, 198]
})
df = spark.createDataFrame(data)
display(df)
Available Chart Types:
- Bar Chart: Compare categories
- Line Chart: Show trends over time
- Pie Chart: Show proportions
- Scatter Plot: Show correlations
- Map: Geographic data visualization
- Histogram: Distribution analysis
- Box Plot: Statistical distribution
Customizing Visualizations:
- Click the chart icon above results
- Select chart type
- Configure:
- Keys (X-axis)
- Values (Y-axis)
- Series groupings
- Aggregation functions
Advanced Visualizations
Using Plotly
import plotly.express as px
import pandas as pd
# Create data
df = pd.DataFrame({
'category': ['A', 'B', 'C', 'D', 'E'],
'value': [23, 45, 56, 78, 32]
})
# Create interactive plot
fig = px.bar(df, x='category', y='value', title='Sales by Category')
fig.show()
Using Matplotlib
import matplotlib.pyplot as plt
import pandas as pd
# Create data
df = pd.DataFrame({
'x': range(10),
'y': [i**2 for i in range(10)]
})
# Create plot
plt.figure(figsize=(10, 6))
plt.plot(df['x'], df['y'], marker='o')
plt.title('Quadratic Growth')
plt.xlabel('X values')
plt.ylabel('Y values')
plt.grid(True)
plt.show()
Using Seaborn
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
# Create sample data
data = pd.DataFrame({
'x': [i for i in range(100)],
'y': [i + np.random.randint(-10, 10) for i in range(100)],
'category': ['A' if i < 50 else 'B' for i in range(100)]
})
# Create visualization
plt.figure(figsize=(12, 6))
sns.scatterplot(data=data, x='x', y='y', hue='category')
plt.title('Scatter Plot with Categories')
plt.show()
Notebook Collaboration
Version Control
Using Built-in Revision History:
# Access: Click "Revision History" in notebook toolbar
# Features:
# - View all saved versions
# - Compare versions side-by-side
# - Restore previous versions
# - See who made changes and when
Using Git Integration:
- Link to Git Repository:
Workspace → Repos → Add Repo → Enter Git URL
- Clone a Repository:
Repos → Create Repo → Git URL → Create
- Commit Changes:
# Via UI: Repos → Select your repo → Commit changes
# Add commit message and push
- Pull Latest Changes:
Repos → Your repo → Pull
Sharing and Permissions
# Share notebook via UI
# 1. Click "Share" button (top right)
# 2. Add users or groups
# 3. Set permissions:
# - Can Run: Execute with read-only access
# - Can Edit: Modify and run
# - Can Manage: Full control
# Schedule notebook as job
# 1. Click schedule icon
# 2. Configure job settings
# 3. Set schedule (cron expression or UI picker)
Comments and Discussions
# Add comments to cells
# 1. Select cell
# 2. Press Cmd/Ctrl + Shift + M
# 3. Type comment
# 4. Mention users with @username
# 5. Mark as resolved when addressed
Performance Optimization
Caching DataFrames
# Cache DataFrame for reuse
df_cached = df.cache()
# Use the cached DataFrame multiple times
df_cached.filter(df_cached.age > 30).count()
df_cached.groupBy("department").count().show()
# Unpersist when no longer needed
df_cached.unpersist()
Lazy vs Eager Evaluation
# Transformations are lazy (not executed immediately)
df_filtered = df.filter(df.age > 30) # Not executed yet
df_grouped = df_filtered.groupBy("department") # Still not executed
# Actions trigger execution
count = df_grouped.count() # Now everything executes
# Use explain to see execution plan
df_grouped.count().explain(mode="formatted")
Avoiding Shuffles
# Shuffle operations are expensive
# Bad: Forces shuffle
df.repartition(100)
# Good: Use coalesce for reducing partitions (no shuffle)
df.coalesce(10)
# Use broadcast for small tables in joins
from pyspark.sql.functions import broadcast
large_df.join(broadcast(small_df), "key")
Best Practices
1. Notebook Organization
# Use markdown cells for structure
%md
# Notebook Title
**Purpose**: Describe what this notebook does
**Author**: Your name
**Last Updated**: Date
## Section 1: Data Loading
## Section 2: Data Processing
## Section 3: Results
2. Modular Code
# Define reusable functions
def clean_data(df, columns_to_drop):
"""Remove specified columns and null values."""
df_clean = df.drop(*columns_to_drop)
df_clean = df_clean.dropna()
return df_clean
def calculate_metrics(df, group_col, value_col):
"""Calculate aggregated metrics."""
return df.groupBy(group_col).agg(
F.sum(value_col).alias("total"),
F.avg(value_col).alias("average"),
F.count(value_col).alias("count")
)
# Use the functions
df_clean = clean_data(df, ["temp_col1", "temp_col2"])
metrics = calculate_metrics(df_clean, "category", "amount")
display(metrics)
3. Error Handling
# Use try-except blocks
try:
df = spark.read.csv("/path/to/data.csv", header=True)
display(df)
except Exception as e:
print(f"Error reading data: {str(e)}")
# Fallback or alternative logic
df = spark.createDataFrame([], schema)
4. Logging and Debugging
import logging
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Use logging
logger.info("Starting data processing")
logger.debug(f"DataFrame has {df.count()} rows")
logger.warning("Missing values detected")
logger.error("Failed to process data")
# Debug with display
display(df.describe())
display(df.printSchema())
5. Testing
# Unit test pattern in notebooks
def test_data_quality(df):
"""Test data quality checks."""
assert df.count() > 0, "DataFrame is empty"
assert df.filter(df.id.isNull()).count() == 0, "Null IDs found"
print("All quality checks passed")
# Run tests
test_data_quality(df)
Troubleshooting
Common Issues
Cell Won't Execute:
# Check cluster status
# Verify cluster is running and attached
# Check for syntax errors
# Review notebook state (restart if needed)
Out of Memory Errors:
# Increase driver/executor memory in cluster config
# Use more partitions
df = df.repartition(200)
# Process data in batches
# Use incremental processing
Slow Performance:
# Check execution plan
df.explain(mode="formatted")
# Add caching strategically
# Optimize joins and shuffles
# Use appropriate file formats (Parquet, Delta)
Next Steps
- Learn about Data Engineering pipelines
- Explore Machine Learning in notebooks
- Review Best Practices for production notebooks