How We Reduced CRM Data Latency by 96%

Our sales and marketing teams were working with 24-hour-old data. A lead would sign up at 9 AM, and important engagement information or derived insights wouldn't be available until 9 AM the next day (sometimes longer or not at all). By then, the lead might have already moved on to a competitor or been marketed to in the wrong way.

We needed real-time CRM data. HubSpot offered a native reverse-ETL sync from Snowflake that looked perfect on paper. In practice, it broke immediately.

The problem? Cybrary uses UUIDs as the primary user identifier. HubSpot uses email addresses. When a user changed their email or used multiple, the algorithm used to detect changes would stop working. Updates could be missed and hard to detect. The native sync was fundamentally incompatible with our data model.

We had two options: pay $450/month for a reverse-ETL service like Hightouch, or build our own. We chose to build. It took a few days and costs ~$50/month to run.

Here's how we did it, and why Snowflake's data share made it feasible.

Build vs Buy: The Decision

When the native sync failed, we evaluated our options:

Option A: Pay for Hightouch (or Census)

Option B: Build Custom

The decision hinged on one question: Could we actually build this quickly?

The answer was yes, thanks to HubSpot's Snowflake data share.

The Key Enabler: Snowflake Data Share

HubSpot provides a Snowflake data share that lets you query their CRM data directly from your Snowflake warehouse. This is a game-changer for building reverse-ETL.

Why it matters:

This meant we could offload the hardest part—finding which records need updating—to Snowflake. The custom code just needed to fetch results and submit them to HubSpot's API.

What We Built

The architecture is simple: Snowflake finds the deltas, Cloud Run submits them to HubSpot.

Architecture:
─────────────

┌──────────────────────────────────────────────────────────┐
│                    Snowflake                              │
│  ┌─────────────────┐         ┌─────────────────┐        │
│  │  Our Warehouse  │         │  HubSpot Share  │        │
│  │  (user data)    │────────▶│  (current CRM)  │        │
│  │                 │  compare │                 │        │
│  └─────────────────┘         └─────────────────┘        │
│         │                                                 │
│         │    SQL Query: What's different?                │
│         │    Returns: ~100-200 changed records           │
│         └─────────────────────────────────────────────┐  │
└───────────────────────────────────────────────────────┼──┘
                                                         │
                  ┌──────────────┐                      │
                  │   Airflow    │                      │
                  │  (scheduler) │ ──every 15 min───────┘
                  └──────────────┘
                         │
                         │ triggers
                         ▼
                  ┌──────────────┐
                  │  Cloud Run   │
                  │     Job      │
                  │  (~1k lines) │
                  └──────────────┘
                         │
                         │ batch upsert
                         ▼
                  ┌──────────────┐
                  │   HubSpot    │
                  │     API      │
                  └──────────────┘
                         
How It Works:
─────────────
1. Cloud Composer (managed Airflow) triggers every 15 minutes
2. Cloud Run queries Snowflake for delta (records that changed)
3. Snowflake compares our warehouse vs HubSpot data share
4. Returns only the differences (~100-200 records typically)
5. Cloud Run submits batch update to HubSpot API
6. Results logged for monitoring

The Delta Query

This is where Snowflake data share shines. Instead of scanning all 1M+ records in application code, we let Snowflake do the comparison:

-- Find contacts that need updating
WITH our_data AS (
  SELECT 
    user_id,
    email,
    phone,
    properties,
    updated_at
  FROM warehouse.contacts
  WHERE is_active = true
    AND updated_at >= CURRENT_TIMESTAMP - INTERVAL '20 minutes'
),
hubspot_data AS (
  SELECT 
    contact_id,
    email,
    phone,
    properties
  FROM hubspot_share.contacts
)
SELECT o.*
FROM our_data o
LEFT JOIN hubspot_data h ON o.email = h.email
WHERE 
  -- New contacts (not in HubSpot)
  h.contact_id IS NULL
  -- Or existing contacts with changes
  OR o.phone != h.phone
  OR o.properties != h.properties
ORDER BY o.updated_at DESC

This query runs entirely in Snowflake and returns only the records that need updating. On average, that's 100-200 contacts per 15-minute window (out of 1M+ total).

The Cloud Run job just fetches these results and submits them to HubSpot's batch API. Simple, fast, cheap.

The Cloud Run Job

The actual sync logic is ~200 lines of Python:

import snowflake.connector
import requests

def sync_contacts():
    """Sync changed contacts from Snowflake to HubSpot"""
    
    # 1. Query Snowflake for delta
    contacts = execute_snowflake_query(DELTA_QUERY)
    
    # 2. Transform for HubSpot API
    hubspot_batch = [
        {
            "email": c["email"],
            "properties": {
                "phone": normalize_phone(c["phone"]),
                "user_id": c["user_id"],
                # ... more fields
            }
        }
        for c in contacts
    ]
    
    # 3. Batch upsert to HubSpot
    response = requests.post(
        "https://api.hubspot.com/contacts/v1/contact/batch",
        json=hubspot_batch,
        headers={"Authorization": f"Bearer {HUBSPOT_API_KEY}"}
    )
    
    # 4. Log results for monitoring
    log_sync_results(response)
    
    return {"contacts_synced": len(contacts)}

That's it. The complexity is in the SQL query (which Snowflake handles), not the application code.

Results

96% latency reduction 90% cost savings Built in 3 days

Infrastructure cost breakdown: Cloud Run compute is minimal (~$1/month, scales to zero between syncs). The bulk of the cost is Snowflake compute for delta queries (~$20-30/month) and Cloud Composer for Airflow scheduling (~$20-30/month). Total: ~$40-60/month depending on query volume. We use ~$50 as a typical estimate.

Business impact: Sales and marketing teams can now act on lead data within 15 minutes instead of 24+ hours. Response time is one of the strongest predictors of conversion.

Unexpected Benefits

Beyond speed and cost, building custom unlocked capabilities we hadn't anticipated:

1. Git-Based Version Control

All sync logic lives in version control. We can:

2. Easy to Add New Syncs

Want to sync a new dataset? Just add a SQL file and a config block. We went from 1 sync to 5 syncs in a month without touching the core pipeline code.

Each sync is just:

No vendor approval needed. No new connectors to purchase. Just commit and deploy.

3. Custom Logging and Alerting

We built exactly the observability we needed:

With vendor solutions, you get their logging. With custom, you get exactly what you need.

When to Build vs Buy

Not every company should build their reverse-ETL pipeline. Here's when it makes sense:

Build when:

  • You have Snowflake data share (or similar) access to your destination system
  • Native integrations don't fit your data model
  • You need custom transformation logic
  • You're syncing high volumes (vendor costs scale with rows)
  • You have 2-3 days of engineering time available

Buy when:

  • You don't have data share access (would need to build ETL first)
  • Native integration works fine for your use case
  • Engineering time is scarce
  • You're syncing to multiple destinations (vendor handles this)

The ROI Calculation

For us, the math was clear:

Plus the intangible benefits: full control, git-based versioning, custom observability, and the ability to add new syncs without vendor costs.

Key Takeaways

Three lessons from building this:

1. Data Shares Change the Build vs Buy Equation

Without Snowflake data share, this would've taken weeks (build ETL, then build sync logic). With it, we built in days. Data shares make custom integrations feasible for smaller teams.

2. Keep the Cloud Run Job Simple

All the complexity lives in SQL (which Snowflake is great at). The application code just orchestrates API calls. This makes it easy to test, debug, and maintain.

3. Start with One Sync, Then Scale

We built the contact sync first. Once that worked, adding new syncs (deals, companies, custom objects) was trivial. Start small, validate the architecture, then expand.

Conclusion

When HubSpot's native sync failed, we had a choice: pay $450/month for a vendor solution or invest 3 days building our own. We chose to build, and it paid off in 6 months.

The key enabler was Snowflake's data share. It turned what could've been a complex multi-week project into a few days of focused work.

If you're facing a similar decision, consider whether you have data share access. If you do, building custom might be faster and cheaper than you think.

Need help with your data infrastructure?

I help B2B SaaS companies optimize their data stacks through FinOps audits and hands-on implementation work.

Whether you're evaluating build vs buy decisions, migrating between systems, or optimizing existing pipelines, let's talk.