Migrating to a data lake: A practical blueprint

You will need the following architectural decisions and migration process to make a data lake the center of your architecture.
May 9, 2025

Your team may consider migrating to a data lake for one or both of the following reasons: 

  1. Centralization of large volumes of structured and unstructured data
  2. Enabling more advanced analytics such as AI

Successfully migrating to a data lake involves several key considerations:

  • Architectural decisions shape how data is stored, accessed, and managed. An architecture should ensure scalability, cost-effectiveness, and compatibility with your organization's data needs. 
  • Tools like Fivetran can streamline migration, automating data movement from various sources into the data lake with minimal effort. 
  • Different query engines work well for different use cases and data stacks. Once the data lands in the data lake, the right query engine can enable analytics and insights without disrupting existing workflows.

Key architectural decisions for your data lake

When building or migrating to a data lake, four foundational decisions shape your architecture:

1. Cloud storage

This is often the most critical choice, as it's typically the hardest to change later. You’ll need to select between AWS S3, Azure Data Lake Storage, or Google Cloud Storage. In many cases, the decision is simplified by your existing cloud footprint—if your organization already relies heavily on AWS, for example, S3 is a natural fit. Overall, cloud storage is highly commoditized, with most options offering similar pricing and features. Still, the differences in performance, pricing, and features across providers are worth some consideration.

2. Table format

Your choice of open table format determines how your data supports features like transactional consistency, ACID compliance, and schema evolution. The Fivetran Managed Data Lake Service supports landing data in both Iceberg and Delta Lake formats, across all major storage providers. This dual-format strategy gives you the ability to query your data using a wide range of engines without being locked into a specific format.

3. Catalog

A catalog maintains metadata about your datasets and is essential for discoverability and consistency. Options include AWS Glue, Unity Catalog, BigQuery Metastore, or Fivetran’s own Iceberg REST Catalog (docs).

Fivetran automatically provisions a dedicated Iceberg REST Catalog. You can also configure third-party catalogs depending on your storage and table format (supported options). See the architectural diagrams below for examples.

4. Query engine

Choosing a query engine depends on your team's familiarity and existing tools. Some engines are bundled with cloud data warehouses and can also query open table formats in a data lake natively. Examples include Snowflake, Databricks, BigQuery, Amazon Athena, Trino, and Apache Spark.

Use interoperability as a guiding principle

These choices are interdependent. For instance, if you plan to query with Databricks, selecting Delta Lake and Unity Catalog makes sense due to deep integration (cf. Databricks Delta Table docs).

Start with the decisions you’re most confident in, such as your storage provider. The inherent interoperability of data lakes means that you should have considerable flexibility to tailor the rest of your architecture to the specific needs of your team and use cases.

Fivetran's interoperable approach

Fivetran's data lake architecture is designed for interoperability. We support both Iceberg and Delta formats, and allow for integration with third-party catalogs like AWS Glue, or Fivetran’s Iceberg REST Catalog (which is Apache’s Polaris Catalog under the hood).

One major benefit of a data lake architecture is flexibility—you’re not locked into one query engine. Instead, your teams can access the same data from different engines, depending on the use case.

Migration and implementation

Once you have set up your data lake, you will need to populate it with your data. Getting started with a Fivetran Managed Data Lake Service is straightforward. We provide a detailed setup guide, including storage-specific instructions for each supported cloud provider. Fivetran offers the capability to perform historical syncs on SaaS and database sources as well as to sync directly from data warehouses.

If you're an existing Fivetran customer currently landing data in a different destination (e.g., Snowflake or BigQuery), contact your Account Manager. Our team can help you migrate your existing connectors to your new data lake destination with minimal disruption.

Querying options for your existing data stack

Once your data lands in the lake, the next step is making it accessible through the tools your teams already use. Fivetran’s architecture is built to support flexible, scalable query integration patterns—allowing you to analyze your data where it makes the most sense.

Fivetran follows an Extract, Load, Transform (ELT) approach: raw data is first loaded into your data lake and then transformed downstream as needed. This separation provides greater flexibility for querying and modeling the data post-ingestion.

There are a plethora of ways to query data in a lake environment. The following two subsections showcase some of the most common approaches. 

Snowflake and Iceberg REST Catalog 

In our Fivetran Iceberg REST Catalog Integration Guide, we walk through how to configure Snowflake to query data stored in your lake via the Iceberg REST Catalog. This pattern allows you to land data in a storage layer (such as S3 or ADLS) with Iceberg table format. After initializing and registering external tables in Snowflake, they can be automatically refreshed to reflect the latest data in the storage layer—eliminating the need for manual refreshes or duplicated ETL pipelines.

Databricks and Unity Catalog

​​Another common and powerful querying pattern for data lakes involves integrating with Unity Catalog and Databricks. In our blog post, A modern data lake with Fivetran Managed Data Lake Service and Databricks Unity Catalog, we detail how Fivetran’s Managed Data Lake Service works seamlessly with Unity Catalog to support this architecture.

When using Fivetran’s native integration, schemas and tables in your lake are automatically kept up to date—streamlining governance and access control. To support a centralized architecture, your Unity Catalog metastore connects to your storage location and acts as a unified metadata layer. You can then attach multiple Databricks workspaces to the same metastore, enabling consistent access across environments without the need to configure multiple Fivetran destinations or maintain separate development setups.

This integration offers a scalable, well-governed path to querying lake data directly from Databricks, with minimal operational overhead.

Fivetran data models on the data lake

Fivetran offers dbt Core-compatible data models (formerly known as “Fivetran dbt packages”) for our most popular connectors. These pre-built models produce clean, analytics-ready tables that feed directly into your reports, dashboards, and BI tools.

You can leverage our dbt Core-compatible data models in a data lake architecture using a supported query engine like BigQuery, Databricks, PostgreSQL, Redshift, or Snowflake and the following steps:

  • Ingest your source data
    Start by setting up the Fivetran connectors for the sources you want to work with, and ensure the necessary external tables are created in your query engine. The specific tables required for each data model are listed in the corresponding documentation. For example, if you're planning to use the facebook_ads__account_report model, the lineage graph shows that it depends on the basic_ad_action_items, basic_ad_actions, account_history, and basic_ad tables. These source tables must be present as external tables in your query engine for the model to run successfully.
  • Create your dbt project
    Once your external tables are available, create a dbt Core project targeting your query engine. Follow our detailed Transformations Setup Guide for step-by-step instructions.
  • Install the Fivetran data models
    In Step 6 of the dbt setup guide, you’ll install the Fivetran data model for your specific source (e.g., Facebook Ads Data Model) in your dbt project.
    When configuring the source model, it’s important to correctly define the database and schema variables that point to where the external tables were created in your data lake environment. Here’s an example for Facebook Ads.

Data lakes made simple

With the automation, flexibility, and interoperability offered by modern tools and technologies, migrating to a data lake has never been easier. Instead of spending engineering time on technical implementation, your team can ensure that it makes the best choices based on its use case and other needs.

Fivetran can help you get most of the way there. Like all Fivetran offerings, the Managed Data Lake Service is designed for simplicity and ease of use, automating data integration and management in the data lake.

[CTA_MODULE]

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data insights
Data insights

Migrating to a data lake: A practical blueprint

Migrating to a data lake: A practical blueprint

May 9, 2025
May 9, 2025
Migrating to a data lake: A practical blueprint
You will need the following architectural decisions and migration process to make a data lake the center of your architecture.

Your team may consider migrating to a data lake for one or both of the following reasons: 

  1. Centralization of large volumes of structured and unstructured data
  2. Enabling more advanced analytics such as AI

Successfully migrating to a data lake involves several key considerations:

  • Architectural decisions shape how data is stored, accessed, and managed. An architecture should ensure scalability, cost-effectiveness, and compatibility with your organization's data needs. 
  • Tools like Fivetran can streamline migration, automating data movement from various sources into the data lake with minimal effort. 
  • Different query engines work well for different use cases and data stacks. Once the data lands in the data lake, the right query engine can enable analytics and insights without disrupting existing workflows.

Key architectural decisions for your data lake

When building or migrating to a data lake, four foundational decisions shape your architecture:

1. Cloud storage

This is often the most critical choice, as it's typically the hardest to change later. You’ll need to select between AWS S3, Azure Data Lake Storage, or Google Cloud Storage. In many cases, the decision is simplified by your existing cloud footprint—if your organization already relies heavily on AWS, for example, S3 is a natural fit. Overall, cloud storage is highly commoditized, with most options offering similar pricing and features. Still, the differences in performance, pricing, and features across providers are worth some consideration.

2. Table format

Your choice of open table format determines how your data supports features like transactional consistency, ACID compliance, and schema evolution. The Fivetran Managed Data Lake Service supports landing data in both Iceberg and Delta Lake formats, across all major storage providers. This dual-format strategy gives you the ability to query your data using a wide range of engines without being locked into a specific format.

3. Catalog

A catalog maintains metadata about your datasets and is essential for discoverability and consistency. Options include AWS Glue, Unity Catalog, BigQuery Metastore, or Fivetran’s own Iceberg REST Catalog (docs).

Fivetran automatically provisions a dedicated Iceberg REST Catalog. You can also configure third-party catalogs depending on your storage and table format (supported options). See the architectural diagrams below for examples.

4. Query engine

Choosing a query engine depends on your team's familiarity and existing tools. Some engines are bundled with cloud data warehouses and can also query open table formats in a data lake natively. Examples include Snowflake, Databricks, BigQuery, Amazon Athena, Trino, and Apache Spark.

Use interoperability as a guiding principle

These choices are interdependent. For instance, if you plan to query with Databricks, selecting Delta Lake and Unity Catalog makes sense due to deep integration (cf. Databricks Delta Table docs).

Start with the decisions you’re most confident in, such as your storage provider. The inherent interoperability of data lakes means that you should have considerable flexibility to tailor the rest of your architecture to the specific needs of your team and use cases.

Fivetran's interoperable approach

Fivetran's data lake architecture is designed for interoperability. We support both Iceberg and Delta formats, and allow for integration with third-party catalogs like AWS Glue, or Fivetran’s Iceberg REST Catalog (which is Apache’s Polaris Catalog under the hood).

One major benefit of a data lake architecture is flexibility—you’re not locked into one query engine. Instead, your teams can access the same data from different engines, depending on the use case.

Migration and implementation

Once you have set up your data lake, you will need to populate it with your data. Getting started with a Fivetran Managed Data Lake Service is straightforward. We provide a detailed setup guide, including storage-specific instructions for each supported cloud provider. Fivetran offers the capability to perform historical syncs on SaaS and database sources as well as to sync directly from data warehouses.

If you're an existing Fivetran customer currently landing data in a different destination (e.g., Snowflake or BigQuery), contact your Account Manager. Our team can help you migrate your existing connectors to your new data lake destination with minimal disruption.

Querying options for your existing data stack

Once your data lands in the lake, the next step is making it accessible through the tools your teams already use. Fivetran’s architecture is built to support flexible, scalable query integration patterns—allowing you to analyze your data where it makes the most sense.

Fivetran follows an Extract, Load, Transform (ELT) approach: raw data is first loaded into your data lake and then transformed downstream as needed. This separation provides greater flexibility for querying and modeling the data post-ingestion.

There are a plethora of ways to query data in a lake environment. The following two subsections showcase some of the most common approaches. 

Snowflake and Iceberg REST Catalog 

In our Fivetran Iceberg REST Catalog Integration Guide, we walk through how to configure Snowflake to query data stored in your lake via the Iceberg REST Catalog. This pattern allows you to land data in a storage layer (such as S3 or ADLS) with Iceberg table format. After initializing and registering external tables in Snowflake, they can be automatically refreshed to reflect the latest data in the storage layer—eliminating the need for manual refreshes or duplicated ETL pipelines.

Databricks and Unity Catalog

​​Another common and powerful querying pattern for data lakes involves integrating with Unity Catalog and Databricks. In our blog post, A modern data lake with Fivetran Managed Data Lake Service and Databricks Unity Catalog, we detail how Fivetran’s Managed Data Lake Service works seamlessly with Unity Catalog to support this architecture.

When using Fivetran’s native integration, schemas and tables in your lake are automatically kept up to date—streamlining governance and access control. To support a centralized architecture, your Unity Catalog metastore connects to your storage location and acts as a unified metadata layer. You can then attach multiple Databricks workspaces to the same metastore, enabling consistent access across environments without the need to configure multiple Fivetran destinations or maintain separate development setups.

This integration offers a scalable, well-governed path to querying lake data directly from Databricks, with minimal operational overhead.

Fivetran data models on the data lake

Fivetran offers dbt Core-compatible data models (formerly known as “Fivetran dbt packages”) for our most popular connectors. These pre-built models produce clean, analytics-ready tables that feed directly into your reports, dashboards, and BI tools.

You can leverage our dbt Core-compatible data models in a data lake architecture using a supported query engine like BigQuery, Databricks, PostgreSQL, Redshift, or Snowflake and the following steps:

  • Ingest your source data
    Start by setting up the Fivetran connectors for the sources you want to work with, and ensure the necessary external tables are created in your query engine. The specific tables required for each data model are listed in the corresponding documentation. For example, if you're planning to use the facebook_ads__account_report model, the lineage graph shows that it depends on the basic_ad_action_items, basic_ad_actions, account_history, and basic_ad tables. These source tables must be present as external tables in your query engine for the model to run successfully.
  • Create your dbt project
    Once your external tables are available, create a dbt Core project targeting your query engine. Follow our detailed Transformations Setup Guide for step-by-step instructions.
  • Install the Fivetran data models
    In Step 6 of the dbt setup guide, you’ll install the Fivetran data model for your specific source (e.g., Facebook Ads Data Model) in your dbt project.
    When configuring the source model, it’s important to correctly define the database and schema variables that point to where the external tables were created in your data lake environment. Here’s an example for Facebook Ads.

Data lakes made simple

With the automation, flexibility, and interoperability offered by modern tools and technologies, migrating to a data lake has never been easier. Instead of spending engineering time on technical implementation, your team can ensure that it makes the best choices based on its use case and other needs.

Fivetran can help you get most of the way there. Like all Fivetran offerings, the Managed Data Lake Service is designed for simplicity and ease of use, automating data integration and management in the data lake.

[CTA_MODULE]

Experience Fivetran Managed Data Lake Service for yourself with a free trial.
Start now

Related blog posts

What is a data lakehouse?
Data insights

What is a data lakehouse?

Read post
Data lakes vs. data warehouses: A cost comparison by GigaOm
Data insights

Data lakes vs. data warehouses: A cost comparison by GigaOm

Read post
Databricks and Fivetran team up to complete the picture for your lakehouse
Data insights

Databricks and Fivetran team up to complete the picture for your lakehouse

Read post
A solution architect perspective: Interoperable data lake architectures
Blog

A solution architect perspective: Interoperable data lake architectures

Read post
Data lakes vs. data warehouses: A cost comparison by GigaOm
Blog

Data lakes vs. data warehouses: A cost comparison by GigaOm

Read post
Unlock interoperability with Fivetran Managed Data Lake Service for Google’s Cloud Storage
Blog

Unlock interoperability with Fivetran Managed Data Lake Service for Google’s Cloud Storage

Read post
A solution architect perspective: Interoperable data lake architectures
Blog

A solution architect perspective: Interoperable data lake architectures

Read post
Data lakes vs. data warehouses: A cost comparison by GigaOm
Blog

Data lakes vs. data warehouses: A cost comparison by GigaOm

Read post
Unlock interoperability with Fivetran Managed Data Lake Service for Google’s Cloud Storage
Blog

Unlock interoperability with Fivetran Managed Data Lake Service for Google’s Cloud Storage

Read post

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
OSZAR »