SQL Server 2012 Data Integration Recipes

  

For your convenience Apress has placed some of the front

matter material after the index. Please use the Bookmarks

and Contents at a Glance links to access them.

Contents at a Glance

  

About the Author .........................................................................................................

Acknowledgments ......................................................................................................

Chapter 1: Sourcing Data from MS Office Applications ...............................................

   ■

Chapter 2: Flat File Data Sources ..............................................................................

   ■

  

Chapter 3: XML Data Sources ..................................................................................

  

Chapter 4: SQL Databases .......................................................................................

  

Chapter 5: SQL Server Sources ................................................................................

  

Chapter 6: Miscellaneous Data Sources ..................................................................

  

Chapter 7: Exporting Data from SQL Server ............................................................

  

Chapter 8: Metadata ................................................................................................

  

Chapter 9: Data Transformation ..............................................................................

  

Chapter 10: Data Profiling .......................................................................................

  

Chapter 11: Delta Data Management .......................................................................

  

Chapter 12: Change Tracking and Change Data Capture .........................................

  

Chapter 13: Organising And Optimizing Data Loads ................................................

  ■

  

Chapter 14: ETL Process Acceleration .....................................................................

  

Chapter 15: Logging and Auditing ...........................................................................

■ Appendix A: Data Types ...........................................................................................

Index ...........................................................................................................................

Introduction

  Microsoft SQL Server 2012 is a vast subject. One part of the ecosystem of this powerful and comprehensive database which has evolved considerably over many years is data integration – or ETL if you want to use another virtually synonymous term. Long gone are the days when BCP was the only available tool to load or export data. Even DTS is now a distant memory. Today the user is spoilt for choice when it comes to the plethora of tools and options available to get data into and out of the Microsoft RDBMS. This book is an attempt to shed some light on many of the ways in which data can be both loaded into SQL Server and sent from it into the outside world. I also try to give some ideas as to which techniques are the most appropriate to use when faced with various different challenges and situations.

  This book is not, however, just an SSIS manual. I have a profound respect for this excellent product, but do not believe that it is the “one stop shop” which some developers take it to be. I wanted to show readers that there are frequently alternative technologies which can be applied fruitfully in many ETL scenarios. Indeed my philosophy is that when dealing with data you should always apply the right solution, and never believe that there is only one answer. Consequently this book includes recipes on many of the other tools in the SQL Server universe. Sometimes I have deliberately shown varied ways of dealing with essentially the same challenge. I hope by doing this to arouse your curiosity and also to provide some practical examples of ways to get data from myriad sources into SQL Server databases cleanly and efficiently.

  Although this book specifically targets users of SQL Server 2012 I try, wherever feasible, to say if a recipe can be applied to previous versions of the database. I also try and highlight any new features and differences between SQL Server 2012 and older versions. This is because it is unlikely that users will only ever deal with the latest version of this RDBMS, and are likely to have multiple versions in production on most sites. I only ever go back to SQL Server 2005 when pointing out how the database has evolved, as this was the version which introduced SSIS - which was the major turning point in SQL Server-based ETL.

  As the book is focused on SQL Server nearly all the code used is T-SQL. Some of the samples given are extremely simple, others are more complex. All of it is concentrated on ETL requirements. Consequently you will find no OLTP or DBA-based examples in this book. You will find a few touches of MDX where handling Analysis Services data is concerned and some VB.Net where SSIS script tasks are used. I have chosen to use VB.Net in nearly all the SSIS script tasks described in this book as it is, in my experience, the .Net language that many T-SQL programmers are most familiar with. Nonetheless I have added one or two snippets of C# (particularly where CLR assemblies are used) to avoid accusations of neglecting this particular language.

  Data integration is a vast subject. Consequently, in an attempt to apply a little structure to a potentially enormous and disparate domain, this book is divided into two main parts. The first part—Chapters 1 through 7—deals with the mechanics of getting data into and out of SQL Server. Here you will find the essential details of how to connect to various data sources, and then ingurgitate the data. As many potential pitfalls and traps as possible are brought to your attention for each data source.

  The second part—Chapters 8 through 15—deal with the wider ETL environment. Here we progress from the nuts and bolts to the coordinated whole of extracting, transforming, and (efficiently) loading data. These chapters take the reader on a trip through the process of metadata analysis, data transformation, profiling source data, logging data processes, and some of the ways of optimizing data loads.

  For this book I decided to avoid the ubiquitous AdventureWorks, and use my own sample database. There are a few reasons for this. Firstly, I thought that AdventureWorks was so large and complex that it could divert attention from some of the techniques which I wanted to explain. I prefer to use an extremely simplistic data structure so that the reader is free to focus on the essence of what is being explained, and not the data itself. Secondly I wished to avoid the added complexity of the multiple interrelated tables and foreign keys present in AdventureWorks. Finally I did not want to be using data which took time to load. This way, once again, you can concentrate on process and principle, and not develop “ETL-stare” while you watch a clock ticking as thousands of records churn into a table, accompanied by whirling on-screen images or the blinking of a bleary-eyed hard disk indicator. Consequently I have preferred to use an extremely uncluttered set of source data. A full description of the source database(s) is given in Appendix B.

  Please also note that this book is not destined to be a progressive self-tuition manual. You are strongly advised to drop and recreate the sample databases between recipes to ensure a clean environment to test the examples that are given. Indeed the whole philosophy of the recipe-based approach is that you can dip in anywhere to find help, except in the rare cases where there are specific indications that a recipe requires prior reading or builds on a previous explanation.

  The recipes in this book cover a wide variety of needs, from the extremely simple to the relatively complex. This is in an attempt to cover as wide a range of subjects as possible. The consequence is that some recipes may seem far too simplistic for certain readers, while others may wonder if the more advanced solutions are relevant to their work. I can only hope that SQL Server beginners will find easy answers and that advanced users will nonetheless find tweaks and suggestions which add to their knowledge. In all cases I sincerely hope that you will find this book useful.

  Inevitably, not every question can be answered and not every issue resolved in one book. I truly hope that I have covered many of the essential ETL tasks that you will face, and have provided ways of solving a reasonable number of the problems that you may encounter. My apologies, then, to any reader who does not find the answer to their specific issue, but writing an encyclopaedia was not an option. In any case, I can only encourage you to read recipes other than those that cover the precise subject that interests you, as you may find potential solutions elsewhere in this book.

  I wish you good luck in using SQL Server to extract, transform, and load data. And I sincerely hope that you have as much fun with it as I had writing this book.

  —Adam Aspin

Chapter 1

Sourcing Data from MS Office Applications

  I suspect that many industrial-strength SQL Server applications have begun life as a much smaller MS Office- based idea, which has then grown and been extended until it has finished as a robust SQL Server application. In any case, two Microsoft Office programs—Excel and Access—are among the most frequently used sources of data for eventual loading into SQL Server. There are many reasons for this, from their sheer ubiquity to the ease with which users can enter data into Access databases and Excel spreadsheets. So it is no wonder that we developers and DBAs spend so much of our time loading data from these sources into SQL Server.

  There are a number of ways in which data can be pushed or pulled from MS Office sources into SQL Server. These include:

  Using T-SQL ( OPENDATASOURCE and OPENROWSET)

  Linked Servers (yes, an Access database or even an Excel spreadsheet can be a linked server)

  SSIS

  The SQL Server Import Wizard

  The SQL Server Migration Assistant for Access

  • ฀ This chapter examines all these techniques and tries to give you some guidelines on their optimal uses (and inevitable limitations).

  Any sample files used in this chapter are found in the C:\SQL2012DIRecipes\CH01 directory—assuming that you have downloaded the samples from the book’s companion web site and installed them as described in Appendix B.

  1-1. Ensuring Connectivity to Access and Excel Problem

  You want to be able to import data from all versions of Excel and Access (including the latest file formats) in both 32-bit and 64-bit environments.

Solution

  You need to install the Microsoft Access Connectivity Engine (ACE) driver. Here are the steps to follow: 1.

  Click Download on the requisite web page. This will download the executable file to your selected directory.

  Note ■ The ACE driver can be found a

  . This

location could change over time—but a quick Internet search should point you to the current source fast enough.

  2. Double-click the AccessDatabaseEngine.exe file that you have downloaded. This will be

  AccessDatabaseEngine_x64.exe for the 64-bit version.

  3. Follow the instructions.

  4. In SSMS, expand Server Objects ➤฀Linked Servers ➤฀Providers.

  6. Double-click the provider and check Allow InProcess and Dynamic Parameter. As an alternative to steps 4-6, if you prefer a command-line approach, run the following T-SQL snippet

  ( C:\SQL2012DIRecipes\CH01\SetACEProperties.Sql in the samples for this book):

  EXECUTE master.dbo.sp_MSset_oledb_prop N'Microsoft.ACE.OLEDB.12.0' , N'AllowInProcess' , 1; GO EXECUTE master.dbo.sp_MSset_oledb_prop N'Microsoft.ACE.OLEDB.12.0' , N'DynamicParameters' , 1; GO You now have the driver installed and ready to use.

How It Works

  Before attempting to read data from Excel or Access, it is vital to ensure that the drivers that allow the files to be read are installed on your server. Only the “old” 32-bit Jet driver is currently installed with an SQL Server installation, and that driver has severe limitations. These are principally that it cannot read the latest versions of Access and Excel, and that it will not function in a 64-bit environment.

  Using the latest ACE driver generally makes your life much easier, as the newest versions have all the capabilities of the older versions as well as adding extra functionality. Despite being called the “AccessDatabaseEngine,” this driver also reads and writes data to Excel files, as well as to text files.

  Confusingly, the 2007 Office System Driver and the Microsoft Access Engine 2010 redistributable are both found as “Microsoft.ACE.OLEDB.12.0” in the list of linked server providers in SSMS. The 64-bit SQL Server applications can access to 32-bit Jet and 2007 Office System files by using 32-bit SQL Server Integration Services (SSIS) on 64-bit Windows.

  The versions of the Office drivers currently available are listed in Ta .

  1-2. Importing Data from Excel Problem You want to import data from an Excel spreadsheet as fast and as simply as possible.

  Microsoft. ACE. OLEDB.12.0

  same server. This means that you cannot develop in Business Intelligence development Studio (BIDS) or SQL Server Development Tools (SSDT) with the 64-bit ACE driver installed—as BIDS/SSDT is a 32-bit environment. However, if you install the 32-bit ACE driver instead, then you cannot run a 64-bit package, and have to use one of the 32-bit workarounds. Ideally, you should develop in a 32-bit environment with the 32-bit ACE driver installed (or on a 64-bit machine, but do not expect to run the package normally), and deploy to a 64-bit environment where the 64-bit driver is ready and waiting.

  either the 64-bit version of the ACE driver or the 32-bit version on the

  Windows Server 2003 R2, x64 editions; Windows Server 2008 R2; Windows Server 2008 with Service Pack 2; Windows Vista with Service Pack 1; and Windows XP with Service Pack 3. You can only install

  If you still want to use the old 32-bit Jet driver, then you can do so provided that you save

  Hints, Tips, and Traps

  32-bit or 64-bit versions available Reads and writes Excel & Access 97-2010 Accepts .xls/.xlsx/.xslm/.xlsx/. xlsb and .mdb/.accdb formats

  

  

  Microsoft Access Engine 2010 redistributable

  Table 1-1.

  32-bit only Reads and writes Excel & Access 97-2007 Accepts .xls/.xlsx/.xslm/.xlsx/ .xlsb and .mdb/.accdb formats

  

  ACE. OLEDB.12.0

  System Driver Microsoft.

  32-bit only Reads and writes Excel & Access 97-2003 Accepts .xls and .mdb formats 2007 Office

  OLEDB.4.0 SQL Server Installation (installed with the client tools)

  OLEDB Provider for Microsoft Jet Microsoft.Jet.

   MS Office Drivers Driver Title Driver Name Source Comments

  • ฀ the Excel source in Excel 97–2003 format and are working in a 32-bit environment. The ACE drivers are supported by Windows 7; Windows Server 2003 R2, 32-bit x86;
Solution Run the SQL Server Import and Export Wizard and use it to guide you through the import process.

  Here is the process to follow:

  1. In SQL Server Management Studio, right-click a database (preferably the one into which you want the data imported), click Tasks ➤฀Import Data (see Figure

  Figure 1-1.

  

Launching the Import/Export Wizard from SSMS

2.

  Skip the splash screen. The Choose a Data Source screen appears.

  3. Select Microsoft Excel as the data Source, and enter or browse for the file to import.

  Be sure to select the Excel version that corresponds to the type of source file from the pop-up list, and specify if your data includes headers (see Figure ).

  Figure 1-2. Choosing a Data Source in the Import/Export Wizard 4. Click Next. The Choose a Destination dialog box appears (see Figure

  Figure 1-3. Choosing a Destination in the Import/Export Wizard 5.

  Ensure that the destination is SQL Server Native Client, that the server name is correct, and that you have selected the right destination database (CarSales_Staging in this example) and the authentication mode which you are using (with the appropriate username and password for SQL Server authentication).

  6. Click Next. The Specify Table Copy or Query dialog box appears (see Figur ).

  7. Accept the default “Copy data from one or more tables or views”.

  8. Click Next. The Select Source Tables or Views dialog box appears (see Figur

  Figure 1-4. Specifying Table Copy or Query in the Import/Export Wizard

  Figure 1-5. Choosing the Source Table(s) in the Import/Export Wizard 9.

  Select the worksheet(s) to import.

  10. Click Next. The Save and Run Package dialog box appears (see Figur

  Figure 1-6. Running the Import/Export Wizard package 11.

  Ensure that Run Immediately is checked and that Save SSIS Package is not checked.

  12. Click Next. The Complete the Wizard dialog box appears (see Figure

  Figure 1-7. Completing the Import/Export Wizard

  13. Click Finish. The Execution Results dialog box appears. Assuming that all went well, the data has loaded successfully (see Figure ).

  Figure 1-8. Successful execution using the Import/Export Wizard 14. Click Close to end the process.

  How It Works

  There will probably be times when your sole aim is to get a load of data from an Excel spreadsheet into an SQL Server table as fast as possible. Now, when I say “fast,” I do not only mean that the time to load is very short, but that the time spent setting up the load process is minimal and that the job gets done without going to the bother of setting up an SSIS package, defining a linked server, or writing T-SQL using

  OPENROWSET to do the job. This is where the SQL Server Import and Export Wizard (DtsWizard for short) comes into its own. An extra inducement is that the guidance provided by the DtsWizard application can be invaluable if you only import spreadsheet data infrequently. As this is the first time that the Import and Export Wizard is explained in this book, I have tried to make the explanation as complete as possible. The advantage is that you will find many of the techniques explained here useable for other types of source data, too.

  You should use the SQL Server Import and Export Wizard: When you need to import data from an Excel spreadsheet into an SQL Server table just •฀ once.

  When you do not intend to perform the action regularly or frequently. •฀ When you rarely import Excel data, you don’t want to get lost in the arcane world of SSIS •฀ and/or rarely used SQL commands. You want the data imported fast.

  When you want to import data from multiple worksheets or ranges in the same workbook. •฀ Assuming that your Excel data is clean and structured like a data table, then the data will load. It can either be transferred to a new table (or new tables), which are created in the destination database with the same name(s) as the source worksheets, or into existing SQL Server tables. You can decide which of these alternatives you prefer in step 8.

Hints, Tips, and Traps

  If you are working in a 64-bit environment, the 32-bit version of the Import/Export •฀ Wizard runs from SSMS. To force the 64-bit version to run, choose Start ➤฀All Programs ➤฀Microsoft SQL Server 2012 ➤฀Import and Export Data (64 bit). Should you need to install the 32-bit version of the wizard, select either Client Tools or SQL Server Data Tools (SSDT) during setup.

  If you plan on using the DtsWizard.exe frequently, add the path to the executable to your •฀ system path variable—unless it has already been added. You can also launch the SQL Server Import and Export Wizard executable by entering •฀ Start ➤฀Run ➤฀DtsWizard.exe (normally found in C:\Program Files\Microsoft SQL Server\110\DTS\Binn), or by double-clicking on the executable in a Windows Explorer window (or even a command window).

  1-3. Modifying Excel Data During a Load Problem

  You want to import data from an Excel spreadsheet, but need to perform a few basic modifications during the import. These could include altering column mapping, changing data types, or choosing the destination table(s), among other things.

  Solution

  Apply some of the available options of the SQL Server Import and Export Wizard. As we are looking at options for the SQL Server Import and Export Wizard, I will describe them as a series of “mini-recipes,” which extend the previous recipe.

  Note ■ Step numbers in the sections to follow refer to the process in Recipe 1-2. Querying the Source Data

  To filter the source data, at step 6, choose the “Write a query to specify the data to transfer”option. You see the dialog box in Figur

  Figure 1-9. Specifying a source query to select Excel data

  Here you can enter an SQL query to select the source data. If you have a saved an SQL query, you can browse to load it. Note that you use the same kind of syntax as when using OPENROWSET, as described in Recipe 1-4. When writing queries, note that worksheet data sources have a “$” postfix, but ranges do not.

  Altering the Destination Table Name In step 8, you can change the destination table name to override the default worksheet or range name.

  Replacing the Data in the Destination Table

  Another available option is to replace all the data in the destination table. Of course, this will only affect an existing table—if the table does not exist, then DTSWiz creates one whichever option is selected.

  To do this, at step 8 from earlier, click Edit Mappings. The Column Mappings dialog box appears (see Figure

  Figure 1-10. Editing column mappings in the Import/Export Wizard Selecting Delete Rows in Destination Table truncates the destination table before inserting the new data.

  This option is only available if the file exists already.

  Enabling Identity Insert

  The Column Mappings dialog box (see Figure ) also lets you enable identity insert, and insert values into an SQL Server Identity column. Simply check the “Enable identity insert” check box.

  Adjusting Column Mappings

  The Column Mappings dialog box also lets you specify which source column maps to which specific destination column. Simply select the required destination column from the pop-up list—or <Ignore> if you do not wish to import the data for a specific column.

  Changing Field Types for New Tables You can—within the permissible limits of data type mappings—change both field types and lengths/sizes.

  Altering the size of a text field avoids the default 255-character import text field length. Changing the field type modifies the field type during the data load.

  If you are creating a new table, then the new table is created with the newly defined field types and sizes. However, be warned, altering data types will not alter the data, and any types or data lengths that you choose must be compatible with the source data, or the load will fail.

  Creating an SQL Server Integration Services (SSIS) Package from the Import/Export Wizard

  An extremely useful feature of the Import/Export Wizard is the ability to create a fully-fledged SSIS package from the parameters that you have set when configuring your import. This is probably no surprise, as the Import/ Export Wizard is, essentially, an SSIS package generator. While the packages that it generates are not perfect, they are a good—and fast—start to an ETL creation process.

  To generate the SSIS package, simply check the Save SSIS Package box in the Save and Execute Package dialog box (see step 9, Figure ). You are prompted for a file location. The package is created when you click Finish.

  How It Works

  Having stressed (I hope) that DtsWizard is a fabulous tool for rapid, simple data imports, I wanted to extend your understanding by showing how versatile a tool the DtsWizard can prove to be in more complex import scenarios. This is due to the wide range of options and parameters that are available to help you to fine-tune Excel imports.

  Hints, Tips, and Traps

  If you are using SQL Server 2005, then you will find a couple of minor differences in the

  • ฀ Choose a Data Source dialog box shown in Figur
  • ฀ invaluable for getting error messages should there be any problems.

  

1-4. Specifying the Excel Data to Load During an Ad-Hoc Import

Problem

  You want to import only a specific subset of data from an Excel spreadsheet by defining the rows to load or filtering the source data.

  Solution

  Use SQL Server’s OPENROWSET command as part of a SELECT statement. This lets you use standard T-SQL to subset the source data. For example, you can run the following code snippets:

  1. In the CarSales_Staging database, create a destination table named LuxuryCars defined as follows (

  C:\SQL2012DIRecipes\CH01\tblLuxuryCars.Sql): CREATE TABLE dbo.LuxuryCars (

  InventoryNumber int NULL, VehicleType nvarchar(50) NULL

  ) ; GO

  2. Enable remote queries, either by running the Facets/Surface Area Configuration tool (or the Surface Area Configuration tool directly in SQL Server 2005), or running the T-SQL given in the following (

  C:\SQL2012DIRecipes\CH01\AllowDistributedQueries.Sql): EXECUTE master.dbo.sp_configure 'show advanced options', 1; GO reconfigure ; GO EXECUTE master.dbo.sp_configure 'ad hoc distributed queries', 1 ; GO reconfigure; GO

  3. Run the following SQL snippet (

  C:\SQL2012DIRecipes\CH01\OpendatasourceInsertACE.Sql):

  INSERT INTO CarSales_Staging.dbo.LuxuryCars (InventoryNumber, VehicleType) SELECT CAST(ID AS INT) AS InventoryNumber, LEFT(Marque, 50) AS VehicleType

  'Microsoft.ACE.OLEDB.12.0', 'Data Source = C:\SQL2012DIRecipes\CH01\CarSales.xls;Extended Properties = Excel 12.0')... Stock$ WHERE MAKE LIKE '%royce%' ORDER BY Marque;

  How It Works

  There are times when quick access to the data in an Excel worksheet is all you need. This could be because you need to perform a quick SELECT...INTO or INSERT INTO...SELECT using Excel as the data source. In this case, firing up SSIS—or even running the Import Wizard (see Recipe 1-2)—to load data can seem like overkill. This is where judicious application of SQL Server’s

  OPENDATASOURCE and OPENROWSET commands as part of a SELECT statement can be extremely useful. Indeed, as you will see shortly, once you know how to connect to the source file, even quite complex T-SQL SELECT statements can be used on Excel source data. And, as you are writing standard SQL commands, they can be run from a query window or as part of a stored procedure. This is particularly useful when:

  • ฀ You want to read the contents of an Excel worksheet, but don’t want to clutter up your database with extra tables of information.
  • ฀ The data will be read infrequently.
  • ฀ You know the file (workbook) and worksheet names, and have a good idea of the data structures—in other words, you can open the file to read it.
  • ฀ When you want to perform ad hoc querying, and choose the columns and filter the data using standard SQL commands.

  Without attempting to be exhaustive, there are some variations on this theme. I use either the Jet driver or the ACE driver indiscriminately. I use Excel worksheets in both 97–2003 and 2007–2010 formats because the techniques described works with all these formats. I am not adding

  INSERT INTO or SELECT ... INTO Code here, but presume that you will be selecting one or the other in a real–world scenario,

  Note ■ As this is, after all, an ad-hoc scenario, you could well have to run SSMS in "Administrator” mode – by right-clicking on SQL Server Management Studio from the start menu and selecting "Run as Administrator”. This is because the user running SSMS must have read and Write permissions on the TEMP directory used by the SQL Server Startup account.

  Assuming that you have a named range ( TinyRange in the sample file), then you can return the data in the range using T-SQL like this: SELECT ID, Marque FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0', 'Excel 8.0;Database = C:\SQL2012DIRecipes\CH01\CarSales.xls', TinyRange);

  If the range does not contain column headers, then you will need to add the HDR = NO property to the T-SQL, as follows. Otherwise, the first row is presumed to be column headers. SELECT ID, Marque FROM OPENROWSET('Microsoft.Jet.OLEDB.4.0', 'Excel 8.0;HDR = NO;Database = C:\SQL2012DIRecipes\CH01\CarSales.xls', TinyRange);

  If you know the Excel range references corresponding to the data that you want to return, then you can use an SQL snippet like this: SELECT ID, Marque FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0', 'Excel 12.0;Database = C:\SQL2012DIRecipes\CH01\CarSales.xlsx', 'SELECT * FROM [Stock$A2:B3]'); You must remember to provide the worksheet as well as the range, as no default worksheet is presumed. Similarly, remember to add HDR = NO if the range does not contain column headers.

  As the previous snippet showed, you can pass an entire SELECT statement via the OLEDB driver to Excel. This presents a whole range of possibilities, such as choosing individual columns. For example: SELECT ID, Marque FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0', 'Excel 12.0;Database = C:\SQL2012DIRecipes\CH01\CarSales.xlsx', 'SELECT ID, Marque FROM [Stock$A1:C3]');

  Just as in a standard T-SQL statement, you can alias the columns returned. For example: SELECT InventoryNumber,VehicleType FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0', 'Excel 12.0;Database = C:\SQL2012DIRecipes\CH01\CarSales.xlsx', 'SELECT ID AS InventoryNumber, Marque AS VehicleType FROM [Stock$A2:C3]');

  The “pass-through” query that you send to Excel can also sort the data that is returned. The following example sorts by Marque: SELECT ID, Marque FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0', 'Excel 12.0;Database = C:\SQL2012DIRecipes\CH01\CarSales.xlsx', 'SELECT ID, Marque FROM [Stock$A2:C3] ORDER BY Marque');

  Finally, if you want to add a WHERE clause, you can do so: SELECT InventoryNumber,VehicleType FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0', 'Excel 12.0;Database = C:\SQL2012DIRecipes\CH01\CarSales.xlsx', 'SELECT ID AS InventoryNumber, Marque AS VehicleType FROM Stock$ WHERE MAKE LIKE ''%royce%'' ORDER BY Marque');

  In the provider options, you need to check Supports ‘Like’Operator for such a sort to work. Note also that you will need to duplicate the single quotes if you are using the LIKE operator. You might have a source file without headers for the data. In this case, all you need to do is add HDR = NO; to the syntax. In these circumstances, it is probably best to use column aliases to give the output data greater readability, or the OLEDB provider will merely rename all the columns F1, F2, and so forth. For example: SELECT InventoryNumber,VehicleType FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0', 'Excel 12.0;HDR = NO;Database = C:\SQL2012DIRecipes\CH01\CarSales.xlsx', 'SELECT F1 AS InventoryNumber, F2 AS VehicleType FROM [Stock$A2:C3] WHERE MAKE LIKE ''%royce%'' ORDER BY Marque');

  HDR is not the only property that you might need to know about when importing Excel data. Ta describes your options. Understanding the IMEX (mixed data types) property is also useful in some cases.

  Table 1-2.

   Jet and ACE Extended Properties Property Name Description

  Examples Specifies if the first row returned contains headers.

  HDR HDR = NO Allows for mixed data types to be imported inside a single column.

  IMEX

  IMEX = 1 Extended properties do require further explanation. Here, HDR merely indicates to the driver whether your source data contains header rows. As the presumption (at least using the Jet and ACE drivers) is that there are header rows, setting this property to

  NO when there are no headers avoids not only having the first record appear as the column names, but also a potential mismatch of data types. It is worth noting that you do not need to specify the Excel file type (.xls/.xlsx/.xslm/.xlsx/.xlsb) as the ACE driver will recognize the file type automatically.

  IMEX is marginally trickier. It does not force the data in a column to be imported as text—it forces the mixed data type defined in the registry for this OLEDB driver to be used. As this registry entry is text by default, it nearly always forces the data in as text. It will not convert the data to text. Depending on the driver (that is, when using the Jet driver in most cases), not setting

  IMEX = 1 can cause a load failure or return NULLs instead of numeric values in a column containing text and numbers.

  1-5. Planning for Future Use of a Linked Server Problem

  You want to import only a subset of data from an Excel spreadsheet, but you suspect that you will need to carry out this operation repeatedly, and eventually migrate it to a linked server solution. You do not want to have to rewrite everything further down the line.

  Solution

  Use SQL Server’s OPENDATASOURCE command as part of a SELECT statement. For example,

  ( C:\SQL2012DIRecipes\CH01\OpendatasourceSelect.Sql):

  SELECT ID AS InventoryNumber, LEFT(Marque,20) AS VehicleType

  INTO RollsRoyce FROM OPENDATASOURCE( 'Microsoft.ACE.OLEDB.12.0', 'Data Source = C:\SQL2012DIRecipes\CH01\CarSales.xls;Extended Properties = Excel 8.0')...Stock$ WHERE MAKE LIKE '%royce%' ORDER BY Marque;

  How It Works

  The OPENROWSET command is suited to ad hoc querying. However, you may be evaluating data connection possibilities with a view to eventually using a linked server. In this case, you may prefer to use the

  OPENDATASOURCE command as a kind of “halfway house” to linked servers (described in the next recipe). This sets the scene for you to update your code to replace OPENDATASOURCE with a four-part linked server reference. Inevitably, there are many variations on this particular theme (which only selects all the data from a source worksheet and uses only the ACE driver), so here are a few of them. As the objective is to import data into SQL

  Server, I will let you choose whether to include this code in either a SELECT..INTO or an INSERT INTO ...SELECT clause. Of course, you can use the Jet driver if you prefer. If you are using Excel 2007/2010, you must set the extended properties in the T-SQL to Excel 12.0.

  SELECT ID, Marque FROM OPENDATASOURCE( 'Microsoft.ACE.OLEDB.12.0', 'Data Source = C:\SQL2012DIRecipes\CH01\CarSales.xlsx;Extended Properties = Excel 12.0')...Stock$;

  To select all the data in a named range, use the following T-SQL: SELECT ID, Marque FROM OPENDATASOURCE( 'Microsoft.ACE.OLEDB.12.0', 'Data Source = C:\SQL2012DIRecipes\CH01\CarSales.xls;Extended Properties = Excel 8.0')... TinyRange;

  To select—and if you wish alias—columns in the Excel source data, use T-SQL like in the following. Note that this is applied to the T-SQL, and is not part of a pass-through query. SELECT ID AS InventoryNumber, Marque AS VehicleType FROM OPENDATASOURCE( 'Microsoft.ACE.OLEDB.12.0', 'Data Source = C:\SQL2012DIRecipes\CH01\CarSales.xls;Extended Properties = Excel 8.0')...Stock$;

  Finally, to use WHERE and ORDER BY when returning Excel data, merely extend the T-SQL like this: SELECT ID AS InventoryNumber, Marque AS VehicleType

  'Microsoft.ACE.OLEDB.12.0', 'Data Source = C:\SQL2012DIRecipes\CH01\CarSales.xls;Extended Properties = Excel 8.0')...Stock$ WHERE MAKE LIKE '%royce%'

ORDER BY Marque;

  In this case, the Excel file must not be password-protected. It is worth noting that OPENDATASOURCE only works when the DisallowAdhocAccess registry option is explicitly set to 0 for the specified provider, and the Ad Hoc Distributed Queries advanced configuration option is enabled as described in Recipe 1-3. OPENDATASOURCE also expects the source data to resemble a table complete with header rows, so ensure that any named ranges have a header row.

  Whether using ACE for Office 2007 or for Office 2010, you must set the Excel version to 12.0—not 14.0 as the download page suggests. Also, if you are using the Jet driver when connecting to Excel (and Access), these approaches will not work in a 64- bit environment in SQL Server (2005–2012), even if the Excel format is 97–2003. If you have to use a driver that causes problems when there are mixed data types in a column, then you can force the driver to scan a larger number of rows (the default is 8)—or indeed the entire worksheet—to test for mixed data types. To do this, edit the following registry setting: HKEY_LOCAL_MACHINE\Software\Microsoft\Jet\4.0\Engines\Excel\TypeGuessRows Setting this value to a figure other than 8 scans that number of rows.Setting it to 0 scans the entire sheet.

  This, however, inevitably causes a severe performance hit.

  Should you wish to alter the mixed data setting, it is in the following registry hive for Office 2010: HKEY_LOCAL_MACHINE\Software\Microsoft\Office\14.0\Access Connectivity Engine\Engines\Excel\ ImportMixedTypes

  The usual caveats apply to changing registry settings: back up your registry first, and be very careful!

  Hints, Tips, and Traps

  • ฀ An error message along the lines of “ Msg 7314, Level 16, State 1, Line 2 The OLE DB

  provider "Microsoft.Jet.OLEDB.4.0” for linked server "(null)” does not contain the table

"Sheet1$ ”. “Either the table does not exist or the current user does not have permissions on

  that file or folder. It could also mean that you have not specified the right file and/or path.

  • ฀ An error message such as “Msg 7399, Level 16, State 1, Line 4 The OLE DB provider

  

"Microsoft.Jet.OLEDB.4.0” for linked server "(null)” reported an error. The provider did not

give any information about the error. Msg 7303, Level 16, State 1, Line 4 Cannot initialize

the data source object of OLE DB provider "Microsoft.Jet.OLEDB.4.0” for linked server

“(null)”. " This could very well mean that the Excel workbook file is open, thus it cannot be

  opened by SQL Server. All you have to do is close the Excel Workbook. Alternatively there could be a permissions problem - are you running SSMS as an Administrator?

  • ฀ The Excel file must not be password-protected.
  • ฀ If all you get back is a NULL value (with a column header of F1), then you probably have not specified the correct worksheet name.
  • ฀ You cannot use UNC paths in ad hoc queries.
  • ฀ For permissions on folders used by Jet, se

  1-6. Reading Data Automatically from an Excel Worksheet Problem You need to be able to query or import data directly from an Excel spreadsheet without (re)loading data every time.

  Solution

  Configure the Excel spreadsheet as a linked server. This is how to do it: 1.

  Define the linked server using the following code snippet ( C:\SQL2012DIRecipes\CH01\AddExcelLinkedServer.Sql): EXECUTEmaster.dbo.sp_addlinkedserver @SERVER = 'Excel' ,@SRVPRODUCT = 'ACE 12.0' ,@PROVIDER = 'Microsoft.ACE.OLEDB.12.0' ,@DATASRC = 'C:\SQL2012DIRecipes\CH01\CarSales.xlsx' ,@PROVSTR = 'Excel 12.0'; 2.

  Query the source data, only using the linked server name and worksheet (or range) name in four-part notation using a T-SQL snippet like ( C:\SQL2012DIRecipes\CH01\SelectEXcelLinkedServer.Sql): SELECT ID, Marque

  INTO XLLinkedLoad FROM Excel...Stock$;

  How It Works

  There are occasions when data in an Excel worksheet is used as a quasi-permanent source of data, yet you do not want to import the data, but prefer to leave it in Excel. To handle this you might need to set up a linked server to an Excel spreadsheet. The advantage to this approach is that the syntax used to query the data is the standard four-part notation used with any linked server (Server.Database.Schema.Table), and that all the configuration information is defined once for the linked server itself, and not every time that a query is used, as is the case with

OPENROWSET and OPENDATASOURCE. This is particularly useful in the following situations:

  When you need to return data from an Excel spreadsheet on a regular basis. •฀ When you feel that spreadsheet data is sufficiently trusted to be a reliable source for your •฀ application.

  The risk with this approach is the same that you face with all spreadsheet data— it is all too frequently incoherent, erroneous, or just plain wrong because human error introduced the wrong data into the wrong cells. Because no automated process can obviate such errors, there is no process that can alleviate the problem. If the Excel file can be trusted to be sufficiently accurate, then this technique can be a great way to allow SQL Server to read Excel data without having to reload the file every time that it is modified, because all that has to be done is to drop the Excel workbook into the required directory. Moreover, there are a few tricks that you might find useful when dealing with Excel linked servers.

  Before using a linked server, you can test the server to see if it works using the following system-stored procedure: EXECUTE master.dbo.sp_testlinkedserver Excel2

  This returns a “Command completed successfully” if all works—and an error message if there is a problem. Unfortunately, the error messages can be somewhat cryptic, so be prepared to be patient when deciphering them.

  To alter the connection to an Excel linked server, you are, in most cases, better off dropping the old linked server and re-creating. The following is the code to drop the linked server:

  IF EXISTS (SELECT name FROM sys.servers WHERE server_id ! = 0 AND name = 'Excel') EXECUTE master.dbo.sp_dropserver @server = 'Excel';

  To list the available worksheets and named ranges for an Excel linked server, use the following system-stored procedure: EXECUTE master.dbo.sp_tables_ex EXCEL;

  For a more visual representation of the data ranges available via your linked server, you can use SQL Server Management studio. All you have to do is expand Server Objects ➤฀Linked Servers ➤฀(Server Name) ➤฀Catalogs ➤฀Default ➤฀Tables, as shown in Figure .

  Figure 1-11.

   Excel linked server tables

  To load data into a destination table, you can use both

  INSERT INTO...SELECT and SELECT...INTO—as you would expect for what is, after all, standard T-SQL. A linked server assumes that the dataset contains a header row. However, an Excellinked server (unlike

  OPENDATASOURCE) accepts named ranges without a header row. Should this be the case in your data set, then be sure to add HDR = NO in the @PROVSTR argument of the sp_addlinkedserver command. This sets the column names to F1, F2, and so forth. So if some of your source ranges have header rows and others do not, you will have to set up two linked servers, one with HDR = NO and the other with HDR = YES. Also, you need to be aware that a linked server to an Excel spreadsheet is extremely slow, and that if you are reusing the data in your ETL process, then loading it into a staging table is probably a lot faster overall.

  Querying the data uses a standard T-SQL SELECT query, and you can restrict the selection using specified column names (or F1, F2, and so forth, if there is no header row), a WHERE clause, ORDER BY, and so on. This means that you can also use CAST and CONVERT to change data types, and all the usual text functions (LTRIM, RTRIM, and LEFT spring to mind) to apply elementary data manipulation to text fields. As I gave plenty of examples of this in Recipes 1-4 and 1-5, I refer you back to those recipes for more details on this.

  Hints, Tips, and Traps

Be sure to set the provider to the ACE or Jet connection string. You also have to set the •฀ @PROVSTR argument to Excel 8.0 (for Jet) or Excel 12.0 (for ACE)

  The •฀ @SRVPRODUCT argument is purely decorative. The Excel file need not exist when the linked server is defined. •฀ You can see the linked server by expanding Server Objects/Linked Servers in SSMS. •฀ Double-click the linked server name in SSMS to view the properties which you set using the sp_addlinkedserver command.