Working with Notebooks

Notebooks are the primary interface for interactive data analysis and development in Databricks. They combine code, visualizations, and narrative text in a single document.

Notebook Basics

Creating and Managing Notebooks

Create a New Notebook:

Workspace → Your Folder → Create → Notebook

Import Existing Notebooks:

From local files (.ipynb, .dbc, .html)
From Git repositories
From other Databricks workspaces

Export Notebooks:

# Via UI: File → Export → Select format
# Formats: Source File, HTML, DBC Archive, IPython Notebook

Notebook Languages

Databricks notebooks support multiple languages:

# Python (default)
df = spark.read.csv("/data/sample.csv")
display(df)

-- SQL
%sql
SELECT * FROM my_table
WHERE date >= '2024-01-01'
LIMIT 100

// Scala
%scala
val df = spark.read.format("delta").load("/data/delta-table")
df.show()

# R
%r
library(SparkR)
df <- read.df("/data/sample.parquet", "parquet")
head(df)

Magic Commands

Magic commands change the execution context:

%python    # Switch to Python
%sql       # Switch to SQL
%scala     # Switch to Scala
%r         # Switch to R
%md        # Markdown for documentation
%sh        # Shell commands
%fs        # Databricks file system operations
%pip       # Install Python packages
%run       # Run another notebook

Advanced Notebook Features

Cell Operations

Keyboard Shortcuts:

Shift + Enter         # Run cell and move to next
Ctrl/Cmd + Enter      # Run cell and stay
Esc + A               # Insert cell above
Esc + B               # Insert cell below
Esc + D D             # Delete cell
Esc + M               # Convert to markdown
Esc + Y               # Convert to code
Ctrl/Cmd + /          # Comment/uncomment lines

Cell Actions:

# Run all cells
# Notebook menu → Run All

# Clear all results
# Notebook menu → Clear → All Results

# Collapse cells for better organization
# Click arrow icon next to cell number

Markdown and Documentation

Create rich documentation with markdown:

%md
# Main Title

## Section Heading

### Subsection

**Bold text** and *italic text*

- Bullet point 1
- Bullet point 2
  - Nested bullet

1. Numbered list
2. Second item

`inline code`

```python
# Code block
def hello():
    print("Hello, World!")

Link text

Image alt text

Column 1	Column 2	Column 3
Data 1	Data 2	Data 3

Blockquote for important notes

Horizontal rule

### Notebook Widgets

Widgets create interactive parameters for notebooks:

#### Text Widget
```python
# Create a text input widget
dbutils.widgets.text("database_name", "default", "Database")

# Get the widget value
db_name = dbutils.widgets.get("database_name")
print(f"Using database: {db_name}")

# Use in SQL
spark.sql(f"USE {db_name}")

# Create a dropdown widget
dbutils.widgets.dropdown("environment", "dev", ["dev", "staging", "prod"], "Environment")

env = dbutils.widgets.get("environment")

# Use in conditional logic
if env == "prod":
    table_name = "production.sales"
else:
    table_name = f"{env}.sales"
    
print(f"Using table: {table_name}")

# Create a multiselect widget
dbutils.widgets.multiselect("regions", "US", ["US", "EU", "APAC", "LATAM"], "Regions")

# Get selected values (comma-separated string)
selected_regions = dbutils.widgets.get("regions").split(",")
print(f"Selected regions: {selected_regions}")

# Use in query
region_filter = "', '".join(selected_regions)
query = f"""
SELECT * FROM sales
WHERE region IN ('{region_filter}')
"""

# Create a combobox widget (dropdown with custom input)
dbutils.widgets.combobox("limit", "100", ["10", "100", "1000", "10000"], "Row Limit")

limit_value = int(dbutils.widgets.get("limit"))
df = spark.table("my_table").limit(limit_value)
display(df)

Remove Widgets

# Remove a specific widget
dbutils.widgets.remove("database_name")

# Remove all widgets
dbutils.widgets.removeAll()

Notebook Workflows

Running Other Notebooks

# Run another notebook and get its return value
result = dbutils.notebook.run(
    "/path/to/notebook",
    timeout_seconds=600,
    arguments={"param1": "value1", "param2": "value2"}
)

print(f"Notebook returned: {result}")

In the called notebook:

# Get parameters passed from parent notebook
param1 = dbutils.widgets.get("param1")
param2 = dbutils.widgets.get("param2")

# Do some processing
result = f"Processed {param1} and {param2}"

# Return value to parent notebook
dbutils.notebook.exit(result)

Notebook Pipeline Pattern

# Master notebook orchestrating a pipeline
def run_pipeline():
    # Step 1: Data ingestion
    dbutils.notebook.run("/pipelines/01_ingest_data", 600, {
        "source_path": "/data/raw/",
        "target_table": "bronze.raw_data"
    })
    
    # Step 2: Data transformation
    dbutils.notebook.run("/pipelines/02_transform_data", 1200, {
        "source_table": "bronze.raw_data",
        "target_table": "silver.clean_data"
    })
    
    # Step 3: Aggregation
    result = dbutils.notebook.run("/pipelines/03_aggregate_data", 900, {
        "source_table": "silver.clean_data",
        "target_table": "gold.aggregated_data"
    })
    
    print(f"Pipeline completed: {result}")

run_pipeline()

Data Visualization

Built-in Visualizations

Databricks provides rich visualization options:

# Create sample data
import pandas as pd

# Sales data
data = pd.DataFrame({
    'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
    'revenue': [45000, 52000, 48000, 61000, 58000, 67000],
    'expenses': [35000, 38000, 36000, 42000, 40000, 45000],
    'customers': [120, 145, 132, 178, 165, 198]
})

df = spark.createDataFrame(data)
display(df)

Available Chart Types:

Bar Chart: Compare categories
Line Chart: Show trends over time
Pie Chart: Show proportions
Scatter Plot: Show correlations
Map: Geographic data visualization
Histogram: Distribution analysis
Box Plot: Statistical distribution

Customizing Visualizations:

Click the chart icon above results
Select chart type
Configure:
- Keys (X-axis)
- Values (Y-axis)
- Series groupings
- Aggregation functions

Advanced Visualizations

Using Plotly

import plotly.express as px
import pandas as pd

# Create data
df = pd.DataFrame({
    'category': ['A', 'B', 'C', 'D', 'E'],
    'value': [23, 45, 56, 78, 32]
})

# Create interactive plot
fig = px.bar(df, x='category', y='value', title='Sales by Category')
fig.show()

Using Matplotlib

import matplotlib.pyplot as plt
import pandas as pd

# Create data
df = pd.DataFrame({
    'x': range(10),
    'y': [i**2 for i in range(10)]
})

# Create plot
plt.figure(figsize=(10, 6))
plt.plot(df['x'], df['y'], marker='o')
plt.title('Quadratic Growth')
plt.xlabel('X values')
plt.ylabel('Y values')
plt.grid(True)
plt.show()

Using Seaborn

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Create sample data
data = pd.DataFrame({
    'x': [i for i in range(100)],
    'y': [i + np.random.randint(-10, 10) for i in range(100)],
    'category': ['A' if i < 50 else 'B' for i in range(100)]
})

# Create visualization
plt.figure(figsize=(12, 6))
sns.scatterplot(data=data, x='x', y='y', hue='category')
plt.title('Scatter Plot with Categories')
plt.show()

Notebook Collaboration

Version Control

Using Built-in Revision History:

# Access: Click "Revision History" in notebook toolbar
# Features:
# - View all saved versions
# - Compare versions side-by-side
# - Restore previous versions
# - See who made changes and when

Using Git Integration:

Link to Git Repository:

Workspace → Repos → Add Repo → Enter Git URL

Clone a Repository:

Repos → Create Repo → Git URL → Create

Commit Changes:

# Via UI: Repos → Select your repo → Commit changes
# Add commit message and push

Pull Latest Changes:

Repos → Your repo → Pull

# Share notebook via UI
# 1. Click "Share" button (top right)
# 2. Add users or groups
# 3. Set permissions:
#    - Can Run: Execute with read-only access
#    - Can Edit: Modify and run
#    - Can Manage: Full control

# Schedule notebook as job
# 1. Click schedule icon
# 2. Configure job settings
# 3. Set schedule (cron expression or UI picker)

Comments and Discussions

# Add comments to cells
# 1. Select cell
# 2. Press Cmd/Ctrl + Shift + M
# 3. Type comment
# 4. Mention users with @username
# 5. Mark as resolved when addressed

Performance Optimization

Caching DataFrames

# Cache DataFrame for reuse
df_cached = df.cache()

# Use the cached DataFrame multiple times
df_cached.filter(df_cached.age > 30).count()
df_cached.groupBy("department").count().show()

# Unpersist when no longer needed
df_cached.unpersist()

Lazy vs Eager Evaluation

# Transformations are lazy (not executed immediately)
df_filtered = df.filter(df.age > 30)  # Not executed yet
df_grouped = df_filtered.groupBy("department")  # Still not executed

# Actions trigger execution
count = df_grouped.count()  # Now everything executes

# Use explain to see execution plan
df_grouped.count().explain(mode="formatted")

Avoiding Shuffles

# Shuffle operations are expensive
# Bad: Forces shuffle
df.repartition(100)

# Good: Use coalesce for reducing partitions (no shuffle)
df.coalesce(10)

# Use broadcast for small tables in joins
from pyspark.sql.functions import broadcast

large_df.join(broadcast(small_df), "key")

Best Practices

1. Notebook Organization

# Use markdown cells for structure
%md
# Notebook Title
**Purpose**: Describe what this notebook does
**Author**: Your name
**Last Updated**: Date

## Section 1: Data Loading
## Section 2: Data Processing  
## Section 3: Results

2. Modular Code

# Define reusable functions
def clean_data(df, columns_to_drop):
    """Remove specified columns and null values."""
    df_clean = df.drop(*columns_to_drop)
    df_clean = df_clean.dropna()
    return df_clean

def calculate_metrics(df, group_col, value_col):
    """Calculate aggregated metrics."""
    return df.groupBy(group_col).agg(
        F.sum(value_col).alias("total"),
        F.avg(value_col).alias("average"),
        F.count(value_col).alias("count")
    )

# Use the functions
df_clean = clean_data(df, ["temp_col1", "temp_col2"])
metrics = calculate_metrics(df_clean, "category", "amount")
display(metrics)

3. Error Handling

# Use try-except blocks
try:
    df = spark.read.csv("/path/to/data.csv", header=True)
    display(df)
except Exception as e:
    print(f"Error reading data: {str(e)}")
    # Fallback or alternative logic
    df = spark.createDataFrame([], schema)

4. Logging and Debugging

import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Use logging
logger.info("Starting data processing")
logger.debug(f"DataFrame has {df.count()} rows")
logger.warning("Missing values detected")
logger.error("Failed to process data")

# Debug with display
display(df.describe())
display(df.printSchema())

5. Testing

# Unit test pattern in notebooks
def test_data_quality(df):
    """Test data quality checks."""
    assert df.count() > 0, "DataFrame is empty"
    assert df.filter(df.id.isNull()).count() == 0, "Null IDs found"
    print("All quality checks passed")

# Run tests
test_data_quality(df)

Troubleshooting

Common Issues

Cell Won't Execute:

# Check cluster status
# Verify cluster is running and attached
# Check for syntax errors
# Review notebook state (restart if needed)

Out of Memory Errors:

# Increase driver/executor memory in cluster config
# Use more partitions
df = df.repartition(200)

# Process data in batches
# Use incremental processing

Slow Performance:

# Check execution plan
df.explain(mode="formatted")

# Add caching strategically
# Optimize joins and shuffles
# Use appropriate file formats (Parquet, Delta)

Next Steps

Learn about Data Engineering pipelines
Explore Machine Learning in notebooks
Review Best Practices for production notebooks

Working with Notebooks

Notebook Basics

Creating and Managing Notebooks

Notebook Languages

Magic Commands

Advanced Notebook Features

Cell Operations

Markdown and Documentation

Multiselect Widget

Combobox Widget

Remove Widgets

Notebook Workflows

Running Other Notebooks

Notebook Pipeline Pattern

Data Visualization

Built-in Visualizations

Advanced Visualizations

Using Plotly

Using Matplotlib

Using Seaborn

Notebook Collaboration

Version Control

Comments and Discussions

Performance Optimization

Caching DataFrames

Lazy vs Eager Evaluation

Avoiding Shuffles

Best Practices

1. Notebook Organization

2. Modular Code

3. Error Handling

4. Logging and Debugging

5. Testing

Troubleshooting

Common Issues

Next Steps

Additional Resources

Notebook Basics​

Creating and Managing Notebooks​

Notebook Languages​

Magic Commands​

Advanced Notebook Features​

Cell Operations​

Markdown and Documentation​

Dropdown Widget​

Multiselect Widget​

Combobox Widget​

Remove Widgets​

Notebook Workflows​

Running Other Notebooks​

Notebook Pipeline Pattern​

Data Visualization​

Built-in Visualizations​

Advanced Visualizations​

Using Plotly​

Using Matplotlib​

Using Seaborn​

Notebook Collaboration​

Version Control​

Sharing and Permissions​

Comments and Discussions​

Performance Optimization​

Caching DataFrames​

Lazy vs Eager Evaluation​

Avoiding Shuffles​

Best Practices​

1. Notebook Organization​

2. Modular Code​

3. Error Handling​

4. Logging and Debugging​

5. Testing​

Troubleshooting​

Common Issues​

Next Steps​

Additional Resources​

Notebook Basics

Creating and Managing Notebooks

Notebook Languages

Magic Commands

Advanced Notebook Features

Cell Operations

Markdown and Documentation

Dropdown Widget

Multiselect Widget

Combobox Widget

Remove Widgets

Notebook Workflows

Running Other Notebooks

Notebook Pipeline Pattern

Data Visualization

Built-in Visualizations

Advanced Visualizations

Using Plotly

Using Matplotlib

Using Seaborn

Notebook Collaboration

Version Control

Sharing and Permissions

Comments and Discussions

Performance Optimization

Caching DataFrames

Lazy vs Eager Evaluation

Avoiding Shuffles

Best Practices

1. Notebook Organization

2. Modular Code

3. Error Handling

4. Logging and Debugging

5. Testing

Troubleshooting

Common Issues

Next Steps

Additional Resources