Build a Logical Folder Structure – The Storage of Data


No single folder structure fits every solution; what is considered a logically designed folder structure is relative. You have seen the following folder structures:

  • EMEA/brainjammer/in/2022/04/01/18
  • EMEA/brainjammer/out/2022/04/01/19
  • EMEA/brainjammer/raw‐files/2022/04/08/18
  • EMEA/brainjammer/cleansed‐data/2022/04/08
  • EMEA/brainjammer/business‐data/2022/04

Those folders are well suited for ingesting data. The structure supports either allowing systems to send the data to those directories or writing a process to retrieve and store the data there. You can also see the data landing zone (DLZ) pattern, which progresses along the Big Data pipeline data transformation processing stages. Another example of a logical folder structure is the way the brainjammer brainwave files are organized:

  • brainjammer/SessionCSV/ClassicalMusic/EEG
  • brainjammer/SessionCSV/MetalMusic/POW
  • brainjammer/SessionCSV/Meditation/EEG
  • brainjammer/SessionJSON/TikTok/POW
  • brainjammer/SessionJSON/PlayingGuitar/POW

The ultimate objective of a folder structure is to organize your files. They need to be arranged in a way that anyone looking for a specific kind of data can find it easily. The directory path name and even file names also can be very helpful for discovering data. The brainjammer brain waves provide that intuitive understanding from the structure and type of data stored within those folders. Folder structures can exist in an ADLS container, on Azure Files, on an FTP site, on your workstation, or via a network SMB share.

Build External Tables

An external table is one that targets data located, for example, on ADLS or an Azure Blob Storage container. Once you create the table from the data stored in those containers, you can perform T‐SQL–like queries against the data. This is a powerful feature. You can make a few configurations and then perform SQL queries against a file without moving it from your data lake. This is the magic of PolyBase, which is working behind the scenes to make this happen. The approach to create an external table depends on which type of SQL pool you are targeting. For a serverless SQL pool, you would create an external table using the CREATE EXTERNAL TABLE AS SELECT (CETAS) statement. When targeting a dedicated SQL pool, you would use the CREATE TABLE AS SELECT (CTAS) command. CETAS is also supported (more about this later). Recall the following example from Chapter 2, “CREATE DATABASE dbName; GO,” noting the value for TYPE:

CREATE EXTERNAL DATA SOURCE Meditation_Source
WITH (LOCATION = ‘abfss://<uid>@<accountName>.dfs.core.windows.net’,
      TYPE = HADOOP);

A Hadoop table type is only available for dedicated SQL pools. The table type used for serverless SQL pools is referred to as native and is considered to be more performant versus dedicated. Therefore, it is currently recommended to use CTAS for dedicated pools until the native type is “completely” supported by dedicated. “Completely” is in quotes because at the time of writing, the native table type is partially supported for dedicated SQL pools—but only for Parquet files, and that is in preview. So, you can use CETAS for dedicated SQL pools, but because it is not currently supported while in preview, the only option you have when targeting a dedicated SQL pool in production is to use the Hadoop table type. Complete Exercise 4.11, where you will build an external table on a serverless SQL pool in Azure Synapse Analytics.

Leave a Reply

Your email address will not be published. Required fields are marked *