Azure Data Factory and Virtual Networks – Choosing the right integration runtime

Azure Data Factory is Microsoft’s cloud service for creating integration pipelines between different data sources and orchestrating these pipelines to create fully automated integration solutions. Data Factory provides capabilities for simply copying data from one place to another, performing transformations on the data and even employing custom functionality from a wide array of different separate services. These services include Databricks, Azure Functions, HDInsight, and much, much more. As such I think it’s fair to call Data Factory “the Azure ETL service.”

Note! For a listing of all supported data sources and built-in activity types in Data Factory, check out the official documentation!

Sometimes, however, you might want to connect your Data Factory to a data source that is being blocked by a firewall: Maybe it’s a database server located in your on-premises network, or perhaps it’s your Azure Data Lake service that has been configured to block access from all Azure services unless they are specifically permitted. In cases like these you need to integrate your Data Factory with a virtual network – or a VNET – that provides network access to the target data source.

VNET integrations can be slightly complicated due to Data Factory’s underlying architecture: The Data Factory resource that you see in Azure Portal is just the orchestration and design layer of the service, and in order to perform its actual integration and ETL logic Data Factory employs separate environments called integration runtimes. There are several types of integration runtimes, each with its own set of capabilities and costs associated. The default Data Factory runtime, the Azure integration runtime, is not associated with a VNET and as such it cannot be used to connect to data sources that are secured with the most restrictive firewalls. Instead, there are three different kinds of integration runtimes that you can create which do allow you to integrate with virtual networks. What are these and when should you choose each one? Let’s find out!

Note! Azure has a concept of “Trusted services”, which includes a list of services that can by default bypass the firewalls of several other Azure products, such as Azure Storage. Data Factory is one of these trusted services, and as such you can access a storage account with default configurations from Data Factory without a VNET. However, if your storage account contains very sensitive data you might want to disable access for these trusted services as part of your security hardening process, and only permit access to the data via VNET!

Three choices for integrating with VNETs

Azure integration runtime with Managed VNET

Note! At the time of writing, Azure integrated runtimes with Managed VNETs are currently in public preview, and as such there is currently no SLA guarantee from Microsoft for using them.

In the introduction I mentioned that the default integration runtime that Data Factory creates for you is not associated with any virtual networks. Well, with Data Factory you can actually create an Azure integration runtime that’s exactly just like the default one – except that it’s associated with a managed VNET. At this point there are possibly two questions you might be asking: First, isn’t that all then, let’s just use the things that’s the same as the default option except with all the shortcomings fixed? And second, wait, what’s a managed VNET?

For the first question, for the most parts you will probably be just fine with the Azure integration runtime with managed VNETs. Since they are just like any other non-VNET Azure runtime, except for the whole VNET thing, they will be a very convenient choice for most Data Factory implementations. Besides, they are the only VNET-enabled runtime that allows you to use Data Flows for transforming data.

There are some downsides to them, too: Currently you can only associate the runtimes with managed VNETs in the same Azure region where your Data Factory resides. This can be a limitation if you are working on large scale integration solutions that cross multiple Azure data centers. Second, Azure integration runtimes are only able to connect to data sources that Data Factory itself supports. This isn’t any different from an integration runtime without a VNET, but the other two VNET-enabled options support even more data sources than what’s available out of the box.

Last, integration runtimes with managed VNETs take some time to start up, whereas normal Azure runtimes are available almost instantly. This is something you need to keep in mind when both scheduling your pipelines and debugging them. They also incur more usage costs, and the pricing terms and Azure pricing calculator don’t make it very easy to predict your managed VNET costs in advance. Regardless, managed VNETs are likely to be the most affordable of the three VNET-enabled runtime options.

As for managed virtual networks, they are a special brand of VNETs that Azure creates and manages for you. Normally what happens with VNETs is that you – or someone else in your organization – create and configure them for yourself and then you add Azure resources into the VNET. With managed VNETs Data Factory does all of that for you by creating a VNET of its own with managed Private Endpoint resources which then get associated with all of the data sources that you want your integration runtime to connect to. In other words, if your solution architecture contains only three Azure services: A Data Factory, a storage account and a SQL database, you could copy data from the storage account to the SQL database via a VNET without even creating one yourself, because the Data Factory can do that for you.

For simple scenarios, like the one I described above, managed VNETs are fantastic. When working with them it’s important to note that since the VNETs are entirely handled by Data Factory, you have very little control over them. As such if you want to use managed VNETs to connect to on-premises networks or other, separate VNETs in your Azure subscription, you will have to get creative with Azure’s private links to connect the managed VNET into other VNETs. Fortunately Microsoft has provided an example tutorial just for such occasions to serve as a starting point.

Azure SSIS integration runtime

An Azure SSIS integration runtime is a runtime that is capable of running SQL Server Integration Services, or SSIS, packages in the cloud. Essentially the Azure SSIS runtime is a virtual machine with SSIS installed on it that you can publish SSIS packages to and then execute them. In the context of Data Factory, Azure SSIS does not utilize any of Data Factory’s own capabilities besides orchestrating package execution and runtime management. Instead, all of the data integration logic is implemented using standard SSIS functionality. As such, SSIS can easily seem like runtime option that’s in Data Factory just for backwards compatibility, since it enables you to lift and shift existing SSIS implementations from on-premises servers to the cloud.

However, that is not the entire truth, since Azure SSIS does offer several benefits: From a purely functional standpoint, with Azure SSIS you can enable connections to data sources that Data Factory itself does not support. For example, if you need to connect to a database that only supports connections via proprietary ODBC drivers, you can customize your SSIS virtual machine to install these drivers on start-up, and then use Azure SSIS to connect to that database. Additionally, if your team has people with strong SSIS skills then choosing Azure SSIS lets you make use of those skills in your cloud based ETL projects right away. Even though learning to use Data Factory isn’t difficult for experienced data engineers, sometimes even that small speedbump can make or break the difference in meeting a crucial deadline.

In terms of virtual networks, when you are configuring an Azure SSIS integration runtime in Data Factory you can associate it with a specific VNET. After that, any SSIS packages run on the runtime will have connectivity to all Azure and on-premises resources that are accessible via that network. In this sense Azure SSIS gives you perfect control over the network resources included in your integration solution.  It’s worth noting, however, that whereas booting up an SSIS runtime that is not associated to a VNET takes only a few minutes of time, a VNET-integrated runtime will take about 30 minutes to start, which is something you need to take into consideration if you want to automate your runtime management.

Note! One downside with Azure SSIS integration runtimes is that they are quite expensive, costing not only for the runtime’s virtual machine itself, but also for the SQL Server license needed to run SSIS packages. Fortunately, you can save on these costs by automating your runtime start-up and shutdown!

Self-hosted integration runtime

Self-hosted integration runtimes are, in terms of available capabilities, the simplest of Data Factory runtimes. They are machines that you managed yourself, either as virtual machines on Azure or as servers in your on-premises network. To utilize a self-hosted integration runtime, you need to install the runtime software on the desired server and associate it with your Data Factory instance. Since you are responsible both for the machine used by the runtime – virtual or otherwise – and the network configurations, self-hosted integration runtimes provide you with the most control over how your IT resources are utilized. It also enables you to make use of physical on-premises servers, in case you happen to have some underutilized hardware lying around. Another benefit to using self-hosted integration runtimes is that, just like with Azure SSIS, you can install your own data source drivers on the runtime, enabling connectivity to data sources that Data Factory normally couldn’t utilize.

There’s a downside to self-hosted integration runtimes, however, and that is: The only Data Factory-based activity enabled to them is the copy activity, which copies data from one source to another as-is. If you want to perform any transformations to the data you can’t do those using Data Factory’s Data Flows, instead, you will have to use Databricks, HDInsight or some other external service. This can sometimes limit the usefulness of self-hosted integration runtimes.

Note! There’s also an alternative approach to transforming data with self-hosted integration runtimes, which is to use the self-hosted runtime to move data to a staging storage in the cloud, and then use another integration runtime for running transformation operations. Technically these kinds of scenarios are simple to build, but they come with separate design concerns of their own. Such as, should the staging storage be protected with a VNET as well? Then if so, are you using Data Flows with an Azure integration runtime or SSIS for the transformations? Is the monetary cost of running multiple VNET-enabled runtimes acceptable for you? And so and so on.

How to choose the right VNET runtime for me

You might have gotten a pretty good idea by now for how to go about choosing your VNET-enabled Data Factory runtime. While there isn’t a simple one-size-fits-all answer to architectural questions like this, I like to approach the question from a perspective of keeping things simple: If the Azure integration runtime with managed VNET, which is the simplest option of the three, fits all of the technical and business requirements that your integration solution has, then that’s what I would choose.

At least I would, if it wasn’t still in preview. It would be irresponsible of me to support using a service that’s not recommended for production usage by the service’s own vendor ( 😉 ), but once it’s in GA, that’s the number one choice!

So how about the other two options? The most obvious thing to look at is whether you can actually connect to your data sources with Data Factory’s own connectors, or do you need to install custom database drivers? If you are technically unable to implement your solution with an Azure integration runtime, then you have to look at either an Azure SSIS or a self-hosted integration runtime. Another point of concern is your Azure network architecture: While it’s unlikely for managed VNETs to be an issue for anyone, just in case you do require more control over your network resources, you will again have to choose between SSIS and self-hosted.

With that the task of choosing a runtime becomes as follows: “Can I use an Azure integration runtime with managed VNETs to implement my solution? If yes, good, if not, then which of the other two do I choose?” The selection between SSIS and self-hosted will depend more on the nature of your solution. If you are implementing a full-fledged ETL (or ELT) solution, I would generally recommend SSIS since it enables you to do pretty much everything by itself – but I will admit that this is also partly just personal preference and being more familiar with SSIS on my part. Self-hosted solutions only support copying data, and with them you need to perform transformation tasks elsewhere.

Of course, this isn’t entirely black and white either, since with some solutions these external services can actually be a lot more performant (e.g., using Databricks in a big data solution to transform terabytes of data). You might also want to consider the fact that, yes, SSIS is bit of a legacy product by now. I don’t see it going away anytime soon, but do you want to develop a new SSIS solution with an expected lifetime of 10-15 years in 2020s? That’s a tough question. In the end it all comes down to considering the specific requirements for your data, including but not limited to, the amount of data you need to transfer, the kinds of transformations you need to implement and the frequency of your runs, and then finding the most appropriate solution for your own unique case.

In closing

There you have it, a brief overview of different options for integrating your Azure Data Factory with virtual networks. Hopefully you now have a better idea on how to go about choosing the right runtimes for your solution architecture – as always with data, it all depends! And if you have any questions, comments or feel like you’d rather have a second opinion on your Data Factory architecture, don’t be a stranger!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s