Salesforce Data Integration Primer - ETL and SSIS

If you are working with Salesforce, especially if you are building out your first instance, you will likely need to populate Salesforce with at least some of your existing data. That could be from a legacy CRM or from other business systems you wish to enrich your Salesforce data with. Once your Salesforce instance is loaded and launched to production you will likely have ongoing integration needs where you will be sending data to Salesforce from on-prem databases and loading those same databases from Salesforce. While Salesforce does offer some decent reporting capabilities it is likely you'll want to bring your Salesforce data down to your ODS or Data Warehouse where you can join it with the rest of your business data to gain the most meaningful insights.

While there are other ways you can integrate data to and from Salesforce, we are focusing on the Business Intelligence wheelhouse. For example, if you wanted to cascade data in real-time from other source systems to Salesforce, you could build that out in custom development. For different reason you may not need that kind of custom solution and a scheduled, incremental data load would suffice.

Some Considerations

Here are some considerations you will likely wrestle with as you get started and some good to know facts before you tackle your first Salesforce integration project. This is a wide-angle view--we will go into more detail later in this guide. 

  • The different types of integration tools we can use to load and extract data from Salesforce.

  • Data integration is performed by accessing Salesforce APIs. Salesforce API endpoints have a standard and bulk processing mode. Use the bulk API to load large amounts of data without consuming too many API calls. Use the standard API to process small batches and properly report on errors.

  • You have a limited number of APIs you can consume each day and this is dependent upon how many users you have licensed in Salesforce.

  • There is a canned report in Salesforce that will tell you how many APIs each user has consumed each day for the last seven days. This is helpful to track how many APIs your data integration process is consuming.

  • When updating data in Salesforce you must use the Salesforce Internal Id (Id). If you want to upsert (insert/update) data to a table in Salesforce you need to create an External Id that is unique in the target table.

  • The lookup data type in Salesforce can only reference another field using the Salesforce Internal Id (Id). You cannot use an External Id or other field.

  • Get familiar with the way Person is modeled in Salesforce. You can have the Individual model or the Person Account model. How Person is modeled will affect how you relate data in Salesforce.

  • In the person account model, contact records are stored in the account table as well as the contact table. In order to relate account contacts to their parent account in the account table, you need to use the AccountContactRelation table in Salesforce.

  • Get familiar with the Setup section in the Salesforce application, this is where you can learn a lot about your environment. Included here are things like the object model in a nice customizable visualization, bulk data load tracking and history, error logs, and feature settings. Make sure you have access to Setup in Salesforce.

Integration Tools

As you start down the road of data integration with Salesforce its good to know what tools you have available to you. Below is an overview of some of the more common tooling for data integration with Salesforce.

Data Loader - The first and most common data integration tool you will hear about when getting started with Salesforce is the Salesforce Data Loader. This is a free tool Salesforce provides that allows users to load files to Salesforce and also export Salesforce data as a file. While you cannot automate work through the Data Loader tool, I would probably install it anyway and get familiar with it. There is a good chance that somebody in your organization will use Data Loader at some point so it's good to know what it can do.

SSIS Integration - Out of the box, SQL Server Integration Services (SSIS) does not support connectivity to Salesforce but there are a number of companies providing tools that will give you Salesforce connectivity inside of SSIS. Most add-in packages for Salesforce include Salesforce data connectors, Source, Destination, and Lookup components which provide all the integration points you will need for your project. Products we have tested and like are SentryOne SSIS Task Factory, CozyRoc SSIS, Devart Data Flow Components, and ZappySys SSIS Components. The cost of these tools are relatively cheap, $300 - $2000 for a server license. The client tools are free though so you can use and test any of these on your workstation. Of the add-ins mentioned above we liked the SentryOne SSIS Task Factory best.

Because of its ubiquity, community support, free license with SQL Server, and ease of use, SSIS can be a great option. Coupled with some cheap Salesforce add-ins, the cost of integration tooling barely registers a blip on the technology budget radar. If you don't already have Microsoft SQL server in your organization, you might find better options out there.

Apex - If you don't have a data integration or BI developer on staff you can always turn to your app devs for your integration needs. Apex is an object-oriented programming language used exclusively for Salesforce. Apex allows you to execute flow and transaction control statements in Salesforce in conjunction with calls to the API. There are some things, like complex workflows that might require some custom development with your app dev team, but day to day integration tasks are probably better kept with your BI and data teams.

iPaaS - Integration Platform as a Service (iPaaS) is new to a lot of organizations but if you already make use of one--Boomi, Mulesoft, Jitterbit, et al, then you already know what it can do. Without getting too deep in what iPaaS is and what it can do for you, I will say that it can be a great tool for building data integration workflows with Salesforce as well as building complex data integration workflows across multiple disparate systems. Out of the box iPaaS tools have everything you need to integrate with Salesforce. iPaaS isn't cheap though, so if your only business use case is Salesforce integration there are probably better options out there.

Skyvia - As a standalone integration tool, Skyvia can help you will all integrations related to Salesforce. Skyvia is produced by Devart a company that also produces the Devart SSIS components for SSIS. I don't have a lot of experience with this tool so I don't have much to say about it. All I know is that this company is based out of Ukraine and so reaching them during business hours in the west can be a challenge. Aside from that, the feedback on this product is pretty good and worth a look if it makes sense for your environment. Pricing is based on number of records processed and starts at $4800 for 10 million records processed per month.

Capstorm - Is a Salesforce data backup and recovery tool that can also be used to replicate data in and out of Salesforce. I don't think you would use this in most common data integration scenarios, but it does have its uses. Worth a look if you also plan on implementing a robust backup solution--it is a leader in that space.

Now that we know what Salesforce integration options are available to us, let's move on to some of the more nuanced discussions around getting data in and out of Salesforce.

Data Integration and Salesforce APIs

The way you integrate data with Salesforce is through one of the Salesforce APIs. You will use either the standard API (REST or SOAP) or the bulk API and it's important you understand the difference between them. It doesn't matter what integration tooling you use, you still connect to the same APIs and you still have the same constraints of each of those. When thinking about common, schedule-based data integration processes you need to choose how you will connect to Salesforce, with the standard or bulk API.

The reason why this is so important is because of the way Salesforce licenses API calls to its system. For each user license you have in Salesforce you are allowed 1000 API calls per day. That includes the API calls those users are making through the Salesforce application. If you are pushing the limits of your daily API cap, the standard API might not be the way to go for your organization. The standard APIs are restricted to much smaller batch sizes compared to the bulk API resulting in higher API counts per data integration operation. The bulk API allows for much larger batch sizes, 5k + per batch. This is an important consideration when you are getting in to data integration with Salesforce and guess what, those API counts are not the only thing you need to be concerned with when using the different APIs.

There is another key difference between these types of APIs that you need to be aware of. When working with the standard API you are basically getting row by row feedback for your data integration operations so if a record fails and the job fails with it, you get that feedback right away whether that is in Data Loader, SSIS, Apex, or iPaaS. The bulk API is different though, instead of going row by row the bulk API produces a large file per batch and then submits that to Salesforce for processing. That means that once you have submitted that batch through the Bulk API failures will not be reported back to the application you submitted them with. In Salesforce there is a section where you can see the status of your bulk API jobs, they can be found here: Setup>Platform Tools>Environments>Jobs>Bulk Data Load Jobs. This is important to note because data loads using the bulk API can feel like a black box unless you know where to look for status on the load you submitted. This is also where you can find errors related to your data loads so you know what needs to be corrected.

It might make more sense to use the bulk API for large loads and use the standard for everything else. If you can build a good incremental workflow this should set you up to use the standard API most of the time. It might just be for those initial loads that you would use the bulk API. Your mileage may vary but these are things you want to have on your radar. Also, don't forget to track down the canned Salesforce report on API usage. This will show you the API usage by user by day for the last seven days.

DML Operations in Salesforce

When it comes time to actually update the data in Salesforce there are several rules you need to follow. There are different DML operations you can access in Salesforce such as insert, update, delete, upsert, and merge. One thing to note about merge, this is not the same as the merge in SQL Server so if you are looking to update/insert in the same operation you can use the upsert option in Salesforce. There are some other nuances about the different DML types you should know.

If you are updating data in Salesforce you need to do that on a Salesforce Internal Id. A Salesforce internal Id is what Salesforce uses to uniquely identify a row in a table in their database-- this as an auto generated identity column, each time a row is created a new Salesforce internal Id is created to represent uniqueness in the table. The only way you can update (not upsert) a Salesforce table is by using this Salesforce internal Id, you cannot update data without it. You will probably end up with some kind of workflow that brings Salesforce data in to your local environment, uses an External Id to join to you other source data, and then uses the Salesforce Internal Id you brought down to do an update, Or…

If you need to upsert (update/insert) data you can't do this on the Salesforce internal Id, for this you will need a Salesforce external Id. A Salesforce external Id is a way for your business to uniquely identify a row in a Salesforce table using a business key, a value that has meaning to the business. You will select and or create a field in the desired table in Salesforce and designate that field an ExternalId, and while it is not required I would also recommend setting the value as unique because you can't upsert on the table if the value is not unique. Once you have the unique ExternalId configured for your target table you can start upserting to the table.

Establishing Data Relations in Salesforce

Loading data is only the beginning, you need to make sure all the data you are loading is properly relating across the Salesforce instance. One of the areas you might get hung up on early is the Person model in Salesforce. Salesforce has different ways you can model Person - the Individual and the Person Account model. I would encourage you to go out and read more about this, Salesforce has some good documentation around it. What I would like to talk about here is the Person Account model, how that works and how you build relationships with the Person Account model.

In the Person Account model, contacts are stored in both the Account table as well as the Contact table. When you load an account contact to the account table, a contact record is also created in the Contact table in Salesforce automatically. In the Account table you can have many different types of account so let's assume we are working with two types for this example, an Account Household and Account Household Contact (members of the household). Once we load both the Account Household and the Household Contact next we need to tell Salesforce they are related and to do that we need to use the AccountContactRelation table in Salesforce. The AccountContactRelation table is a bridge table/junction object in Salesforce that connects a contact record to an account record using the Salesforce Internal Ids of both the account and the contact. This is the primary way you will relate contacts and accounts in Salesforce in the Person Account model.

Other ways you can relate data in Salesforce is to create or make use of lookup fields. A lookup field is a data type in Salesforce that expects a Salesforce Internal Id on which to lookup/relate the value for. Lookup fields are only compatible with Salesforce Internal Ids, they will not work with an External Id or other field types. Imagine you have an Account type of Registered Investment Advisor (RIA) and you have Advisors who work for this RIA. If you want to relate the Advisor to the RIA you might have a lookup field named Advisor in the Account table that is a lookup data type. This uses the Salesforce Internal Id for the Advisor which provides a lookup for the Advisor record in the RIA record thus building a kind of relationship between the two record types.

You can also build your own junction objects for many-to-many relationships but for one-to-many you can use the lookup fields.

Setup and Troubleshooting

Inevitably you are going to run in to issues during your Salesforce integration project so you need to know where to look. The best resource is in the Setup interface in Salesforce which can be reached by clicking the gear in the top right corner of your Salesforce application and clicking the Setup option. Make sure you have the correct permissions to see this. Once you are in the Setup section you should have access to the Following:

Schema Builder - This is where you can find the relational model (object model) for your Salesforce instance. This has a nice customizable interface that will give you greater insight in to the database relations in Salesforce. Can be found here:

o    Setup>Platform Tools> Objects and Fields>Schema Builder

Object Manager - This is where you can view the fields and attributes configured for Salesforce tables (objects). This can be found here:

o    Setup>Platform Tools> Objects and Fields>Object Manager

Bulk Data Load Jobs - If you are running bulk API jobs you will want to know more about the execution of those jobs and to do that you will need to go to the Setup section that shows active jobs and a history of all jobs run including an errors in those jobs. These can be found here:

o    Setup>Platform Tools>Environments>Jobs>Bulk Data Load Jobs

API Usage Report  - This is something you will need to keep an eye on. Here are instructions for viewing or creating this report: https://help.salesforce.com/articleView?id=000326126&type=1&mode=1

Conclusion

This should provide a good primer to get you up to speed with data integration in Salesforce. Once you know all the rules and how to make it work, it really is quite easy.

Remove Empty Row from CSV File Using PowerShell in SSIS

When you export data to a flat file destination in SSIS you end up with an empty row at the end of the file. This is by design, because each row requires a CR LF. That last row of data still has a carriage return at the end of it thus the blank row that follows.

Finding a solution to this problem is not straightforward and that is likely due to the fact this is not really a problem, but not everyone will agree with that opinion. If you have been Googling you probably stumbled upon the conditional split solution, but that only works if you are starting from a csv that has the trailing blank row and you are writing that data to some place other than a flat file destination. We want to remove the empty row after we write to the flat file destination.

If you are not a C# developer or a VB developer (nobody should be one of these actually), then you might struggle with the script task in SSIS. Instead of that we are going to invoke a process, PowerShell.exe. Here are the steps you will need to follow:

  • Make sure PowerShell is installed/enabled on the server that the package runs on.

  • On the server where your SSIS package is executed, we need to create the PowerShell script to remove the blank row (below). Replace the value in the $textFile variable to your target file. Save the script in a directory where the SSIS service account can execute the file and modify the file, etc. Save the file as PS_RemoveBlankRow.ps1 or whatever you like.

$textFile = 'C:\YourFileShare\YourTargetFile.csv'
[System.IO.File]::WriteAllText($textFile, [System.IO.File]::ReadAllText($textFile) -replace '[\r\n]+$')
  • Before you can remove the empty row from your file you first need to save the file to its final destination. So work through all the steps in your SSIS package to get your data to the desired flat file.

  • Now we are ready to clean the file. In the control flow tab of your package add an Execute Process Task. Click on the Process tab, in the Executable option type PowerShell.exe, this will invoke PowerShell. In the arguments type the following (below). The –F option is for file, then you are entering the directory and file of the PS file we created in the second step. You can leave the rest of the settings to their defaults.

-F C:\MyFileShare\PS_RemoveBlankRow.ps1

  • You might have to grant an exception on executing this script file if you have some kind of AV that blocks it. Depends on your environment.

There you go, a simple way to remove the trailing empty row in a CSV.

Troubleshooting Custom dll’s (ionic) in Script Tasks in SSIS

Script tasks are a great way to enrich your ETL, especially when you have complex requirements that can't easily be accomplished in a simple task or expression. The other day while working on a project I ran across an issue involving a script task and a custom dll, specifically ionic.zip.dll.  The script task was failing so I started looking in to the dll. As I was troubleshooting the issue I put together this guide so we could replicate our steps in production. Here is a simple checklist you can follow when working with script tasks and custom dll's in SSIS.

  • Copy the dll's to your target assemblies directory (depending on your environment, your directory may be different): C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.0
  • Once you copy the dll's, right click and go to properties of each dll and make sure the file is not marked as "blocked". Windows will mark a copied file it thinks is dangerous to run on the machine, like a dll or executable, as blocked. Click Unblock and apply/ok.
  • Open the Script task in SSIS. Click on the references folder in Solution Explorer. Highlight the dll you are calling. Under reference properties, set the Copy Local option to True.
  • When working with a custom dll you may need to register it to the GAC (Global Assembly Cache), unless you have a proper installer for it. Open a CMD prompt and navigate to the folder where the gacutil.exe is. Mine could be found here: C:\Program Files (x86)\Microsoft SDKs\Windows\v7.0A\bin\NETFX 4.0 Tools, though yours might be different. Copy your dll to the folder and register it. Gacutil usage can be found here: https://msdn.microsoft.com/en-us/library/ex0ss12c(v=vs.110).aspx

Modifying the SSIS Environment Reference in SQL Agent Jobs That Run an SSIS Package in the Service Catalog

If you are using the SSIS Integration Services Catalog and you run SSIS packages in SQL Agent jobs, you are probably taking advantage of the environments feature in SSIS (2012 +). Environments map to individual servers, servers that could be members of a mirror, dev/QA/prod, or servers that are part of a server migration. Assuming you run all of your SSIS on a dedicated server, this gives you a lot of flexibility as to where a package is executed. When you setup an Agent job you tell it what environment you want the job to run in. From time to time you'll need to change this environment during a fail-over or migration or some other event. This is where it gets tricky--there is no easy way to update a large number of SQL Agent jobs, not without modifying string fields in MSDB tables. Honestly, I'm surprised Microsoft hasn't done anything to improve this yet.

I am posting this right from the script I wrote hopefully making it easier to save off to your own script collection. There are 2 parts to this script, the first part shows you how to change the environment reference by modifying the msdb..sysjobsteps table directly, the second part shows you how to modify the jobs using sp_update_jobstep . If you are a go-by-the-book type of person you might want to refer to the second part of the script, otherwise you can light a hundred dollar bill on fire, snapchat a picture of your ass to your boss, and then refer to the first part of the script.

-- The DGAF method

/* First we need to find the environment we are currently using and
   the environment we plan to use. Each of these correspond to 2 different
   servers. */

SELECT
   Reference_ID
  ,Environment_Name
  ,Environment_Folder_Name
FROM SSISDB.internal.environment_references

/* Make a note of your environments, current and target. We are looking for the reference ID that
   corresponds to the current server so we can pass this reference_id as a filter. We are going to update the
   command line in the agent job for all jobs that refer to this environment reference. */

/* Now that you know the environment you want to target, next you can run the query below and filter
   on that environment. We are just selecting first so we can see what we'll be changing, then we will
   update. */

SELECT DISTINCT
   SUBSTRING(js.command, (PATINDEX('%/ENVREFERENCE %', js.command) + 14),2) AS AgentEnvironmentRef
  ,job_id
  ,step_id
  ,command
FROM msdb..sysjobsteps js
WHERE SUBSTRING(js.command, (PATINDEX('%/ENVREFERENCE %', js.command) + 14),2) = '8'; -- '8' is the environment reference_id, this is what we want to change.

/* If we want to update all job steps that reference the environment
   associated with reference_id 8, we could run the following. This will change
   all job steps that reference environment 8 to environment 5 */

--UPDATE msdb..sysjobsteps
--SET command = REPLACE(command, '/ENVREFERENCE 8', '/ENVREFERENCE 5' )
--WHERE SUBSTRING(command, (PATINDEX('%/ENVREFERENCE %', command) + 14),2) = '8'

/* All of this assumes you are comfortable updating the sysjobsteps table directly,
   which many of us are not and it is a practice that Mircosoft doesn't recommend.
   So here is another way we could implement this change using the msdb.dbo.sp_update_jobstep sp. */

--=================================================================

-- The "I only go by the book" method

-- Create a temp table to store job details

CREATE TABLE #Refchange (
   AgentEnvironmentRef nvarchar(5) NOT NULL
  ,job_id uniqueidentifier NOT NULL
  ,step_id int NOT NULL
  ,command nvarchar(2000) NOT NULL );
  
-- Load your temp table with the jobs/steps you want to change        

WITH CTEazyE AS (
   SELECT DISTINCT
      SUBSTRING(js.command, (PATINDEX('%/ENVREFERENCE %', js.command) + 14),2) AS AgentEnvironmentRef
     ,job_id
     ,step_id
     ,command
 FROM msdb..sysjobsteps js
 WHERE SUBSTRING(js.command, (PATINDEX('%/ENVREFERENCE %', js.command) + 14),2) = '8' )
INSERT INTO #Refchange
SELECT
   AgentEnvironmentRef
  ,job_id
  ,step_id
  ,command
FROM CTEazyE c
INNER JOIN SSISDB.internal.environment_references e ON c.AgentEnvironmentRef = e.reference_id
WHERE AgentEnvironmentRef = '8';

-- Take a look at what we have now

SELECT * FROM #Refchange;

-- Update the command field in our temp table, modify the environment reference

UPDATE #Refchange
   SET command = REPLACE(command, '/ENVREFERENCE 8', '/ENVREFERENCE 5' )
WHERE SUBSTRING(command, (PATINDEX('%/ENVREFERENCE %', command) + 14),2) = '8'

/* Now we should have our job, job step, and environment details in our temp table.

Next we need to update each job to use the new environment using sp_update_jobstep. */

DECLARE
   @jobid uniqueidentifier
  ,@stepid int
  ,@newcommand nvarchar(2000)
  
DECLARE UpdateJobCur CURSOR STATIC FORWARD_ONLY
FOR
SELECT
   job_id
  ,step_id
  ,command
FROM #Refchange

OPEN UpdateJobCur
WHILE (1=1)
BEGIN
FETCH NEXT FROM UpdateJobCur INTO @jobid, @stepid, @newcommand

IF @@FETCH_STATUS <> 0 BREAK

SELECT 
   @jobid
  ,@stepid
  ,@newcommand
EXEC msdb.dbo.sp_update_jobstep
     @job_id = @jobid
    ,@step_id = @stepid
    ,@command = @newcommand
END

CLOSE UpdateJobCur;
DEALLOCATE UpdateJobCur;
GO

Make sure you use caution when making changes that affect the SSIS environments in Agent jobs, you could really mess something up if you don't understand what you are changing. Microsoft needs to implement a better way to do this. The connect item can be seen here.

How to Convert a CYYMMDD Date in SSIS

If you have ever had to extract a date field from DB2 where the date format follows a CYYMMDD format because of Y2K, you probably understand some of the challenges of getting that in to a usable date field in SQL. In our environment the DB2 CYYMMDD was stored as a decimal, we needed to transform this to conform with a Date data type in SQL. Here is how we do it.

First you need to specify your data source in SSIS, this will be the DB2 data source. Once you have your data in the pipeline you'll want to create a Derived Column transform. Under the "Derived Column" drop down select "add as new column". In the Derived Column Name provide a name for your column, in this example DC_ReleasedDate. In the Expression field type the following expression:

(DT_DATE)(ReleasedDate == 0 ? "1900/01/01" : (SUBSTRING(((DT_WSTR,10)(ReleasedDate + 19000000)),1,4) + "/" + SUBSTRING(((DT_WSTR,10)(ReleasedDate + 19000000)),5,2) + "/" + SUBSTRING((DT_WSTR,10)(ReleasedDate + 19000000),7,2)))

You can replace "ReleasedDate" with your own DB2 date data field. In the above expression, we are creating a default date of "1900/01/01" if no date exists but this part can be removed if you need.

Where you see (DT_DATE) at the beginning of the expression, that is casting the expression as a date. Following that you see ReleasedDate. ReleasedDate is the name of our date field in DB2. We are looking for any dates that = 0 and if a date = 0 then it is replaced with "1900/01/01" (a string), if the date does not = 0 then it adds 19000000 to the decimal date and then converts it to a string (DT_WSTR). We add the 19000000 so we can get accurate dates from both 0 and 1 century fields, the c in cyymmdd. Next we are using SUBSTRING to get the yyyy. We do the same for month and day while concatenating each with a forward slash. The end result is a string of yyyy/mm/dd that is cast as a date, where we started in the expression. You can also accomplish this in a script task but for me I found this to be easier.

The destination table in SQL has a [Date] data type since we don't have a timestamp in DB2, only date. You could still work the same solution with a timestamp, it would just require a little extra work in the expression.