diff --git a/src/connections/storage/catalog/data-lakes/index.md b/src/connections/storage/catalog/data-lakes/index.md index 9d96da8d11..d05220b839 100644 --- a/src/connections/storage/catalog/data-lakes/index.md +++ b/src/connections/storage/catalog/data-lakes/index.md @@ -7,9 +7,7 @@ redirect_from: {% include content/plan-grid.md name="data-lakes" %} Segment Data Lakes provide a way to collect large quantities of data in a format that's optimized for targeted data science and data analytics workflows. You can read [more information about Data Lakes](/docs/connections/storage/data-lakes/) and learn [how they differ from Warehouses](/docs/connections/storage/data-lakes/comparison/) in Segment's Data Lakes documentation. -Segment supports two type of data-lakes: -- [AWS Data Lakes](/docs/connections/storage/catalog/data-lakes/#set-up-segment-data-lakes) -- [Segment Data Lakes (Azure)](/docs/connections/storage/catalog/data-lakes/#set-up-segment-data-lakes-azure) +Segment only supports [AWS Data Lakes](/docs/connections/storage/catalog/data-lakes/#set-up-segment-data-lakes). > success "" > You can also set up your Segment Data Lakes using [Lake Formation](/docs/connections/storage/data-lakes/lake-formation/), a fully managed service built on top of the AWS Glue Data Catalog. @@ -83,302 +81,6 @@ Replay processing time can vary depending on the volume of data and number of ev Segment creates a separate EMR cluster to run replays, then destroys it when the replay finishes. This ensures that regular Data Lakes syncs are not interrupted, and helps the replay finish faster. -## Set up Segment Data Lakes (Azure) - -To set up Segment Data Lakes (Azure), create your Azure resources and then enable the Data Lakes destination in the Segment app. - -### Prerequisites - -Before you can configure your Azure resources, you must complete the following prerequisites: -- [Create an Azure subscription](https://azure.microsoft.com/en-us/free/){:target="_blank”} -- [Create an Azure resource group](https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/manage-resource-groups-portal#create-resource-groups){:target="_blank”} -- Create an account with `Microsoft.Authorization/roleAssignments/write` permissions -- Configure the [Azure Command Line Interface (Azure CLI)](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli){:target="_blank”} - -### Step 1 - Create an ALDS-enabled storage account - -1. Sign in to your [Azure environment](https://portal.azure.com){:target="_blank”}. -2. From the [Azure home page](https://portal.azure.com/#home){:target="_blank”}, select **Create a resource**. -3. Search for and select **Storage account**. -4. On the Storage account resource page, select the **Storage account** plan and click **Create**. -5. On the **Basic** tab, select an existing subscription and resource group, give your storage account a name, and update any necessary instance details. -6. Click **Next: Advanced**. -7. On the **Advanced Settings** tab in the Security section, select the following options: - - Require secure transfer for REST API operations - - Enable storage account key access - - Minimum TLS version: Version 1.2 -8. In the Data Lake Storage Gen2 section, select **Enable hierarchical namespace**. In the Blob storage selection, select the **Hot** option. -9. Click **Next: Networking**. -10. On the **Networking** page, select **Disable public access and use private access**. -11. Click **Review + create**. Take note of your location and storage account name, and review your chosen settings. When you are satisfied with your selections, click **Create**. -12. After your resource is deployed, click **Go to resource**. -13. On the storage account overview page, select the **Containers** button in the Data storage tab. -14. Select **Container**. Give your container a name, and select the **Private** level of public access. Click **Create**. - -> warning " " -> Before continuing, note the Location, Storage account name, and the Azure storage container name: you'll need this information when configuring the Segment Data Lakes (Azure) destination in the Segment app. - -### Step 2 - Set up Key Vault - -1. From the [home page of your Azure portal](https://portal.azure.com/#home){:target="_blank”}, select **Create a resource**. -2. Search for and select **Key Vault**. -3. On the Key Vault resource page, select the **Key Vault** plan and click **Create**. -4. On the **Basic** tab, select an existing subscription and resource group, give your Key Vault a name, and update the **Days to retain deleted vaults** setting, if desired. -5. Click **Review + create**. -6. Review your chosen settings. When you are satisfied with your selections, click **Review + create**. -7. After your resource is deployed, click **Go to resource**. -8. On the Key Vault page, select the **Access control (IAM)** tab. -9. Click **Add** and select **Add role assignment**. -10. On the **Roles** tab, select the `Key Vault Secrets User` role. Click **Next**. -11. On the **Members** tab, select a **User, group, or service principal**. -12. Click **Select members**. -13. Search for and select the `Databricks Resource Provider` service principal. -14. Click **Select**. -15. Under the **Members** header, verify that you selected the Databricks Resource Provider. Click **Review + assign**. - -### Step 3 - Set up Azure MySQL database - -1. From the [home page of your Azure portal](https://portal.azure.com/#home){:target="_blank”}, select **Create a resource**. -2. Search for and select **Azure Database for MySQL**. -3. On the Azure Database for MySQL resource page, select the **Azure Database for MySQL** plan and click **Create**. -4. Select **Single server** and click **Create**. -5. On the **Basic** tab, select an existing subscription and resource group, enter server details and create an administrator account. Due to the configurations required for the setup, Data Lakes supports MySQL version **5.7** only. Before you proceed, please ensure you have the correct MySQL server version selected. -6. Click **Review + create**. -7. Review your chosen settings. When you are satisfied with your selections, click **Create**. -8. After your resource is deployed, click **Go to resource**. -9. From the resource page, select the **Connection security** tab. -10. Under the Firewall rules section, select **Yes** to allow access to Azure services, and click the **Allow current client IP address (xx.xxx.xxx.xx)** button to allow access from your current IP address. -11. Click **Save** to save the changes you made on the **Connection security** page, and select the **Server parameters** tab. -12. Update the `lower_case_table_names` value to 2, and click **Save**. -13. Select the **Overview** tab and click the **Restart** button to restart your database. Restarting your database updates the `lower_case_table_name` setting. -14. Once the server restarts successfully, open your Azure CLI. -15. Sign into the MySQL server from your command line by entering the following command: - ```curl - mysql --host=/[HOSTNAME] --port=3306 --user=[USERNAME] --password=[PASSWORD] - ``` -16. Run the `CREATE DATABASE` command to create your Hive Metastore: - ```sql - CREATE DATABASE ; - ``` - -> warning " " -> Before continuing, note the MySQL server URL, username and password for the admin account, and your database name: you'll need this information when configuring the Segment Data Lakes (Azure) destination in the Segment app. - - -### Step 4 - Set up Databricks - -> info "Databricks pricing tier" -> If you create a Databricks instance only for Segment Data Lakes (Azure) usage, only the standard pricing tier is required. However, if you use your Databricks instance for other applications, you may require premium pricing. - -1. From the [home page of your Azure portal](https://portal.azure.com/#home){:target="_blank”}, select **Create a resource**. -2. Search for and select **Azure Databricks**. -3. On the Azure Database for MySQL resource page, select the **Azure Databricks** plan and click **Create**. -4. On the **Basic** tab, select an existing subscription and resource group, enter a name for your workspace, select the region you'd like to house your Databricks instance in, and select a pricing tier. For those using the Databricks instance only for Segment Data Lakes (Azure), a Standard pricing tier is appropriate. If you plan to use your Databricks instance for more than just Segment Data Lakes (Azure), you may require the premium pricing tier. -5. Click **Review + create**. -6. Review your chosen settings. When you are satisfied with your selections, click **Create**. -7. After your resource is deployed, click **Go to resource**. -8. On the Azure Databricks Service overview page, click **Launch Workspace**. -9. On the Databricks page, select **Create a cluster**. -10. On the Compute page, select **Create Cluster**. -11. Enter a name for your cluster and select the `Standard_DS4_v2` worker type. Set the minimum number of workers to 2, and the maximum number of workers to 8. __Segment recommends deselecting the "Terminate after X minutes" setting, as the time it takes to restart a cluster may delay your Data Lake syncs.__ -12. Click **Create Cluster**. -13. Open [your Azure portal](https://portal.azure.com/#home){:target="_blank”} and select the Key Vault you created in a previous step. -14. On the Key Vault page, select the JSON View link to view the Resource ID and vaultURI. Take note of these values, as you'll need them in the next step to configure your Databricks instance. -15. Open `https://#secrets/createScope` and enter the following information to connect your Databricks instance to the Key Vault you created in an earlier step: - - **Scope Name**: Set this value to `segment`. - - **Manage Principal**: Select **All Users**. - - **DNS Name**: Set this value to the Vault URI of your Key Vault instance. - - **Resource ID**: The Resource ID of your Azure Key Vault instance. -16. When you've entered all of your information, click **Create**. - -> warning " " -> Before continuing, note the Cluster ID, Workspace name, Workspace URL, and the Azure Resource Group for your Databricks Workspace: you'll need this information when configuring the Segment Data Lakes (Azure) destination in the Segment app. - -### Step 5 - Set up a Service Principal - -1. Open the Databricks instance you created in [Step 4 - Set up Databricks](#step-4---set-up-databricks). -2. Click **Settings** and select **User settings**. -3. On the Access tokens page, click **Generate new token**. -4. Enter a comment for your token, select the lifetime of your ticket, and click **Generate**. -5. Copy your token, as you'll use this to add your service principal to your workspace. -6. Open your Azure CLI and create a new service principal using the following commands:
-``` powershell -az login -az ad sp create-for-rbac --name -``` -7. In your Azure portal, select the Databricks instance you created in [Step 4 - Set up Databricks](#step-4---set-up-databricks). -8. On the overview page for your Databricks instance, select **Access control (IAM)**. -9. Click **Add** and select **Add role assignment**. -10. On the **Roles** tab, select the `Managed Application Operator` role. Click **Next**. -11. On the **Members** tab, select a **User, group, or service principal**. -12. Click **Select members**. -13. Search for and select the Service Principal you created above. -14. Click **Select**. -15. Under the **Members** header, verify that you selected your Service Principal. Click **Review + assign**. -16. Return to the Azure home page. Select your storage account. -17. On the overview page for your storage account, select **Access control (IAM)**. -18. Click **Add** and select **Add role assignment**. -19. On the **Roles** tab, select the `Storage Blob Data Contributor` role. Click **Next**. -20. On the **Members** tab, select a **User, group, or service principal**. -21. Click **Select members**. -22. Search for and select the Service Principal you created above. -23. Click **Select**. -24. Under the **Members** header, verify that you selected your Service Principal. Click **Review + assign**. -25. Open your Key Vault. In the sidebar, select **Secrets**. -26. Click **Generate/Import**. -27. On the Create a secret page, select **Manual**. Enter the name `spsecret` for your secret, and enter the name of the secret you created in Databricks in the **Value** field. -28. From your Azure CLI, call the Databricks SCIM API to add your service principal to your workspace, replacing ` `with the URL of your Databricks workspace, ` `with the access token you created in an earlier step, and `` with the client ID of your service principal:
-```curl -curl -X POST 'https:///api/2.0/preview/scim/v2/ServicePrincipals' \ - --header 'Content-Type: application/scim+json' \ - --header 'Authorization: Bearer ' \ - --data-raw '{ - "schemas":[ - "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal" - ], - "applicationId":"", - "displayName": "test-sp", - "entitlements":[ - { - "value":"allow-cluster-create" - } - ] - }' -``` -29. Open Databricks and navigate to your cluster. Select **Permissions**. -30. In the permissions menu, grant your service principal **Can Manage** permissions. - -> warning " " -> Before continuing, note the Client ID and Client Secret for your Service Principal: you'll need this information when configuring the Segment Data Lakes (Azure) destination in the Segment app. - -### Step 6 - Configure Databricks Cluster - -> warning "Optional configuration settings for log4j vulnerability" -> While Databricks released a statement that clusters are likely unaffected by the log4j vulnerability, out of an abundance of caution, Databricks recommends updating to log4j 2.15+ or adding the following options to the Spark configuration:
`spark.driver.extraJavaOptions "-Dlog4j2.formatMsgNoLookups=true"`
`spark.executor.extraJavaOptions "-Dlog4j2.formatMsgNoLookups=true"` - -1. Connect to a [Hive metastore](https://docs.databricks.com/data/metastores/external-hive-metastore.html){:target="_blank”} on your Databricks cluster using the following Spark configuration, replacing the variables (``) with information from your workspace:
-```py -## Configs so we can read from the storage account -spark.hadoop.fs.azure.account.oauth.provider.type..dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider -spark.hadoop.fs.azure.account.oauth2.client.endpoint..dfs.core.windows.net https://login.microsoftonline.com//oauth2/token -spark.hadoop.fs.azure.account.oauth2.client.secret..dfs.core.windows.net -spark.hadoop.fs.azure.account.auth.type..dfs.core.windows.net OAuth -spark.hadoop.fs.azure.account.oauth2.client.id..dfs.core.windows.net -## -## -spark.hadoop.javax.jdo.option.ConnectionDriverName org.mariadb.jdbc.Driver -spark.hadoop.javax.jdo.option.ConnectionURL jdbc:mysql://:/?useSSL=true&requireSSL=true&enabledSslProtocolSuites=TLSv1.2 -spark.hadoop.javax.jdo.option.ConnectionUserName -spark.hadoop.javax.jdo.option.ConnectionPassword -## -## -## -spark.hive.mapred.supports.subdirectories true -spark.sql.storeAssignmentPolicy Legacy -mapreduce.input.fileinputformat.input.dir.recursive true -spark.sql.hive.convertMetastoreParquet false -## -datanucleus.autoCreateSchema true -datanucleus.autoCreateTables true -spark.sql.hive.metastore.schema.verification false -datanucleus.fixedDatastore false -## -spark.sql.hive.metastore.version 2.3.7 -spark.sql.hive.metastore.jars builtin -``` - -3. Log in to your Databricks instance and open your cluster. -4. On the overview page for your cluster, select **Edit**. -5. Open the **Advanced options** toggle and paste the Spark config you copied above, replacing the variables (``) with information from your workspace. -6. Select **Confirm and restart**. On the popup window, select **Confirm**. -7. Log in to your Azure MySQL database using the following command:
-```curl -mysql --host=[HOSTNAME] --port=3306 --user=[USERNAME] --password=[PASSWORD] -``` -8. Once you've logged in to your MySQL database, run the following commands:
-```sql -USE -INSERT INTO VERSION (VER_ID, SCHEMA_VERSION) VALUES (0, '2.3.7'); -``` -9. Log in to your Databricks cluster. -10. Click **Create** and select **Notebook**. -11. Give your cluster a name, select **SQL** as the default language, and make sure it's located in the cluster you created in [Step 4 - Set up Databricks](#step-4---set-up-databricks). -12. Click **Create**. -13. On the overview page for your new notebook, run the following command:
-```sql -CREATE TABLE test (id string); -``` -14. Open your cluster. -15. On the overview page for your cluster, select **Edit**. -16. Open the **Advanced options** toggle and paste the following code snippet:
-```py -datanucleus.autoCreateSchema false -datanucleus.autoCreateTables false -spark.sql.hive.metastore.schema.verification true -datanucleus.fixedDatastore true -``` -17. Select **Confirm and restart**. On the popup window, select **Confirm**. - -### Step 7 - Enable the Data Lakes destination in the Segment app - -After you set up the necessary resources in Azure, the next step is to set up the Data Lakes destination in Segment: - -1. In the [Segment App](https://app.segment.com/goto-my-workspace/overview){:target="_blank”}, click **Add Destination**. -2. Search for and select **Segment Data Lakes (Azure)**. -2. Click the **Configure Data Lakes** button, and select the source you'd like to receive data from. Click **Next**. -3. In the **Connection Settings** section, enter the following values: - - **Azure Storage Account**: The name of the Azure Storage account that you set up in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account). - ![img.png](images/storageaccount.png) - - **Azure Storage Container**: The name of the Azure Storage Container you created in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account). - ![img_1.png](images/storagecontainer.png) - - **Azure Subscription ID**: The ID of your [Azure subscription](https://docs.microsoft.com/en-us/azure/azure-portal/get-subscription-tenant-id){:target="_blank”}.
Please add it as it is in the Azure portal, in the format `********-****-****-****-************` - - **Azure Tenant ID**: The Tenant ID of your [Azure Active directory](https://docs.microsoft.com/en-us/azure/active-directory/fundamentals/active-directory-how-to-find-tenant){:target="_blank”}.
Please add it as it is in the Azure portal, in the format `********-****-****-****-************` - - **Databricks Cluster ID**: The ID of your [Databricks cluster](https://docs.databricks.com/workspace/workspace-details.html#cluster-url-and-id){:target="_blank”}. - - **Databricks Instance URL**: The ID of your [Databricks workspace](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids){:target="_blank”}.
The correct format for adding the URL is 'adb-0000000000000000.00.azureatabricks.net' - - **Databricks Workspace Name**: The name of your [Databricks workspace](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids){:target="_blank”}. - - **Databricks Workspace Resource Group**: The resource group that hosts your Azure Databricks instance. This is visible in Azure on the overview page for your Databricks instance. - - **Region**: The location of the Azure Storage account you set up in [Step 1 - Create an ALDS-enabled storage account](#step-1---create-an-alds-enabled-storage-account). - - **Service Principal Client ID**: The Client ID of the Service Principal that you set up in [Step 5 - Set up a Service Principal](#step-5---set-up-a-service-principal). - - **Service Principal Client Secret**: The Secret for the Service Principal that you set up in [Step 5 - Set up a Service Principal](#step-5---set-up-a-service-principal). - - -### (Optional) Set up your Segment Data Lake (Azure) using Terraform - -Instead of manually configuring your Data Lake, you can create it using the script in the [`terraform-segment-data-lakes`](https://github.com/segmentio/terraform-segment-data-lakes){:target="_blank”} GitHub repository. - -> warning "" -> This script requires Terraform versions 0.12+. - -Before you can run the Terraform script, create a Databricks workspace in the Azure UI using the instructions in [Step 4 - Set up Databricks](#step-4---set-up-databricks). Note the **Workspace URL**, as you will need it to run the script. - -In the setup file, set the following local variables: - -```js - -locals { -region = "" -resource_group = "" -storage_account = " -```py -spark.sql.hive.metastore.schema.verification.record.version false -``` -
After you've added to your config, restart your cluster so that your changes can take effect. If you continue to encounter errors, [contact Segment Support](https://segment.com/help/contact/){:target="_blank"}. - -#### What do I do if I get a "Version table does not exist" error when setting up the Azure MySQL database? -Check your Spark configs to ensure that the information you entered about the database is correct, then restart the cluster. The Databricks cluster automatically initializes the Hive Metastore, so an issue with your config file will stop the table from being created. If you continue to encounter errors, [contact Segment Support](https://segment.com/help/contact/){:target="_blank"}. diff --git a/src/connections/storage/data-lakes/index.md b/src/connections/storage/data-lakes/index.md index d2a823fb09..620d6d71e8 100644 --- a/src/connections/storage/data-lakes/index.md +++ b/src/connections/storage/data-lakes/index.md @@ -19,16 +19,16 @@ Data lakes typically have four layers: ![A graphic showing the information flowing from the metadata into the query, compute, and metadata layers, and then into the storage layer](images/data_lakes_overview_graphic.png) -Segment Data Lakes sends Segment data to a cloud data store, either AWS S3 or Azure Data Lake Storage Gen2 (ADLS), in a format optimized to reduce processing for data analytics and data science workloads. Segment data is great for building machine learning models for personalization and recommendations, and for other large scale advanced analytics. Data Lakes reduces the amount of processing required to get real value out of your data. +Segment Data Lakes sends Segment data to a cloud data store, AWS S3, in a format optimized to reduce processing for data analytics and data science workloads. Segment data is great for building machine learning models for personalization and recommendations, and for other large scale advanced analytics. Data Lakes reduces the amount of processing required to get real value out of your data. > warning "Segment Data Lakes deletion policies" -> Segment Data Lakes (AWS) and Segment Data Lakes (Azure) do not support Segment's [user deletion and suppression](/docs/privacy/user-deletion-and-suppression/) capabilities, as you retain your data in systems that you manage. +> Segment Data Lakes (AWS) does not support Segment's [user deletion and suppression](/docs/privacy/user-deletion-and-suppression/) capabilities, as you retain your data in systems that you manage. To learn more about Segment Data Lakes, check out the Segment blog post [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}. ## How Data Lakes work -Segment supports Data Lakes hosted on two cloud providers: Amazon Web Services (AWS) and Microsoft Azure. Each cloud provider has a similar system for managing data, but offer different query engines, post-processing systems, and analytics options. +Segment supports Data Lakes hosted on Amazon Web Services (AWS). Each cloud provider has a similar system for managing data, but offer different query engines, post-processing systems, and analytics options. ### How Segment Data Lakes (AWS) works @@ -40,16 +40,6 @@ Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapR ![A diagram visualizing data flowing from a Segment user into your account and into a Glue catalog/S3 bucket](images/dl_vpc.png) -### How Segment Data Lakes (Azure) works - -Data Lakes store Segment data in ADLS in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, like the Hive Metastore. The resulting data set is optimized for use with systems like Power BI and Azure HDInsight or machine learning vendors like Azure Databricks or Azure Synapse Analytics. - -![A diagram showing data flowing from Segment, through DataBricks, Parquet and Azure Data Lake Storage Gen2 into the Hive Metastore, and then into your post-processing systems](images/Azure_DL_setup.png) - -## Set up Segment Data Lakes (Azure) - -For detailed Segment Data Lakes (Azure) setup instructions, see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/). - ### Set up Segment Data Lakes (AWS) When setting up your data lake using the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/), be sure to consider the EMR and AWS IAM components listed below. @@ -64,18 +54,6 @@ Data Lakes uses an IAM role to grant Segment secure access to your AWS account. - **external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview){:target="_blank"} by navigating to **Settings > General Settings > ID**. - **s3_bucket**: Name of the S3 bucket used by the Data Lake. -### Set up Segment Data Lakes (Azure) - -To connect Segment Data Lakes (Azure), you must set up the following components in your Azure environment: - -- [Azure Storage Account](/docs/connections/storage/catalog/data-lakes/#step-1---create-an-alds-enabled-storage-account): An Azure storage account contains all of your Azure Storage data objects, including blobs, file shares, queues, tables, and disks. -- [Azure KeyVault Instance](/docs/connections/storage/catalog/data-lakes/#step-2---set-up-key-vault): Azure KeyVault provides a secure store for your keys, secrets, and certificates. -- [Azure MySQL Database](/docs/connections/storage/catalog/data-lakes/#step-3---set-up-azure-mysql-database): The MySQL database is a relational database service based on the MySQL Community Edition, versions 5.6, 5.7, and 8.0. -- [Databricks Instance](/docs/connections/storage/catalog/data-lakes/#step-4---set-up-databricks): Azure Databricks is a data analytics cluster that offers multiple environments (Databricks SQL, Databricks Data Science and Engineering, and Databricks Machine Learning) for you to develop data-intensive applications. -- [Databricks Cluster](/docs/connections/storage/catalog/data-lakes/#step-6---configure-databricks-cluster): The Databricks cluster is a cluster of computation resources that you can use to run data science and analytics workloads. -- [Service Principal](/docs/connections/storage/catalog/data-lakes/#step-5---set-up-a-service-principal): Service principals are identities used to access specific resources. - -For more information about configuring Segment Data Lakes (Azure), see the [Data Lakes setup page](/docs/connections/storage/catalog/data-lakes/#set-up-segment-data-lakes-azure). ## Data Lakes schema @@ -128,15 +106,6 @@ The schema inferred by Segment is stored in a Glue database within Glue Data Cat > info "" > The recommended IAM role permissions grant Segment access to create the Glue databases on your behalf. If you do not grant Segment these permissions, you must manually create the Glue databases for Segment to write to. -### Segment Data Lakes (Azure) schema - -Segment Data Lakes (Azure) applies a consistent schema to make raw data accessible for queries. A transformer automatically calculates the desired schema and uploads a schema JSON file for each event type to your Azure Data Lake Storage (ADLS) in the `/staging/` directory. - -Segment partitions the data in ALDS by the Segment source, event type, then the day and hour an event was received by Segment, to ensure that the data is actionable and accessible. - -The file path looks like this: -`//staging//` - ### Data types Data Lakes infers the data type for an event it receives. Groups of events are polled every hour to infer the data type for that each event. @@ -181,7 +150,7 @@ Segment doesn't support User deletions in Data Lakes, but supports [user suppres ### How does Data Lakes handle schema evolution? -As the data schema evolves, both Segment Data Lakes (AWS) and Segment Data Lakes (Azure) can detect new columns and add them to Glue Data Catalog or Azure Data Lake Storage (ADLS). However, Segment can't update existing data types. To update Segment-created data types, please reach out to [AWS Support](https://aws.amazon.com/contact-us/){:target="_blank"} or [Azure Support](https://support.microsoft.com/en-us/topic/contact-microsoft-azure-support-2315e669-8b1f-493b-5fb1-d88a8736ffe4){:target="_blank"}. +As the data schema evolves, Segment Data Lakes (AWS) can detect new columns and add them to Glue Data Catalog. However, Segment can't update existing data types. To update Segment-created data types, please reach out to [AWS Support](https://aws.amazon.com/contact-us/){:target="_blank"}. ### How does Data Lakes work with Protocols? @@ -206,9 +175,6 @@ Data Lakes offers 12 syncs in a 24 hour period and doesn't offer a custom sync s ### What is the cost to use AWS Glue? You can find details on Amazon's [pricing for Glue](https://aws.amazon.com/glue/pricing/){:target="_blank"} page. For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table. -### What is the cost to use Microsoft Azure? -You can find details on Microsoft's [pricing for Azure](https://azure.microsoft.com/en-us/pricing/){:target="_blank"} page. For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table. - ### What limits does AWS Glue have? AWS Glue has limits across various factors, such as number of databases per account, tables per account, and so on. See the [full list of Glue limits](https://docs.aws.amazon.com/general/latest/gr/glue.html#limits_glue){:target="_blank"} for more information. @@ -222,9 +188,3 @@ Segment stops creating new tables for the events after you exceed this limit. Ho You should also read the [additional considerations in Amazon's documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html){:target="_blank"} when using AWS Glue Data Catalog. -### What analytics tools are available to use with Segment Data Lakes (Azure)? -Segment Data Lakes (Azure) supports the following analytics tools: - - PowerBI - - Azure HDInsight - - Azure Synapse Analytics - - Databricks