Michael Barrett's profileBI by Michael BarrettBlog Tools Help

Michael Barrett

Occupation
Location

BI by Michael Barrett

January 31

Analysis Services 2005 performance guide

The Analysis Services 2005 performance guide has been released. Get it here: Analysis Services 2005 performance guide. Note that many of the suggestions in the paper only apply if you have installed SQL Server 2005 Service Pack 2 (yet to be released).

December 12

SSIS: A distinct component

As a follow up to Jamie Thomsons post on "SSIS: A distinct component please" (found here) I thought I would share how I have implemented a distinct component using the script component in the SSIS pipeline.

The script component is - of course - set up to be asynchronous by modifying the synchronous input ID of the output from the script component. It will only be partially-blocking, however, as it does not need to wait for all rows to be passed in before the output is created. A new row is added to the output every time a new key value is encountered in the input. This is contrary to the distinct functionality of the built-in Sort component, where you have to wait for all rows to be passed in, before the distinct values are sent down the output.

The implementation uses a hashtable to keep track of the distinct values, and to reproduce it you can do the following:

1. Create a new SSIS package
2. Create an OLE DB Connection manager that points to the AdventureworksDW database.
3. Create a data flow task in the package
4. In the dataflow, create an OLE DB Source with the source SQL statement:

SELECT Color
FROM DimProduct

5. Tie the output of the OLE DB Source to the input of a new script component.
6. In the script component under "Input" select the Color attribute
7. Under "Inputs and outputs" select the output and modify the synchronous input ID to something other than the default value (this makes the output behave asynchronously).
8. Create a new output column called Color - remember to adjust the data type to match the data type of the input column.
9. Add the following code to your script.

Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports System.Collections

Public Class ScriptMain

Inherits UserComponent
Dim htDistinct As Hashtable = New Hashtable

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

  If Not htDistinct.ContainsKey(Row.Color) Then
    htDistinct.Add(Row.Color, DBNull.Value)
    Output0Buffer.AddRow()
    Output0Buffer.Color = Row.Color
  End If

End Sub

End Class

10. Optionally, connect the script component to a row count transformation and add a data viewer to the path between the two components.
11. Run the package and watch how only distinct values are output from the script component.

This example illustrates how to make a distinct operation on a single column in the data flow, but it can be easily extended to making a distinct operation on more than one column.

July 05

MDX: Subcubes should set the current query context

Currently, subcubes/subselects used in MDX queries in Analysis Services 2005 do not set the current query context. This is a huge problem in calculated members that need the current context in order to return correct results.

For examples and an elaboration, please see:
http://www.sqljunkies.com/WebLog/reckless/archive/2006/03/08/18601.aspx
http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=282896&SiteID=1
http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=488496&SiteID=1

Excel 2007 makes extensive use of subselects/subcubes for restricting query results, so the problem will appear in this client application with the current behavior of AS 2005.

If you agree with me, please vote for my suggestion at: http://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=156581

May 01

SSIS: Handling large data flows

Recently I had an experience while developing an SSIS package that I would like to share. The package had a simple control flow containing 5 Execute SQL Tasks, one Send Mail Task and a Data Flow Task. Simple as this sounds, the data flow was really large. Some of the components that it contained were:

* 3 OLE DB sources
* 4 OLE DB destinations
* Approx. 20 Lookup components
* Approx. 20 Conditional Split components
* Approx. 20 Union All transformations
* Approx. 30 Derived Column transformations
* Approx. 20 Script components

I should mention that the data flow was used for loading fact data, and the components were (primarily) used to assign surrogate keys and inserting inferred members in dimension tables using the "Project Real Method".

The problem with the package was that when I was developing it, the UI became increasingly slow to work with, when I was editing the data flow. In the end it was almost impossible to work with the data flow, as I had to wait 15-20 seconds for the UI to react every time I clicked the mouse. Also, opening the package took a good 10-15 minutes - and this was also the case when DTEXEC (not DTSDebugHost) had to execute the package. There was simply a 10-15 minute delay before the package started executing!

Now, what could I do? I needed the functionality in the package, but I could not live with the slow (impossible) editing of the package, and the delay when executing the package was also unacceptable. I came up with a solution that is almost as good as if I had been able to keep the logic in one data flow. I simply "cut the data flow in half", creating two data flows. The first data flow wrote its output to a raw file and the next data flow simply picked up data from this file and continued its execution.

It's not a perfect solution, but it solved the problem, and if you have a large data flow that for some reason becomes very slow to work with, perhaps you could use this approach as well...

I think what I have encountered could be described as a bug. I haven't filed it with MS, though.

March 08

MDX: EXISTING operator and sub-selects

This is the first post in my blog, which will be focusing on BI-related issues on the SQL Server platform - i.e.: Analysis Services, Integration Services and Reporting Services.

This blog entry is about the use of the EXISTING operator in MDX calculations (well, really it is about one of the pitfalls of using EXISTING). The operator is new to Analysis Services 2005 and can be used to filter a set by the current coordinate in the cube. This is very useful and can for instance be used for making MDX calculations "multi-select friendly" (for more details on this, see Mosha's blog post here: http://www.sqljunkies.com/WebLog/mosha/archive/2005/11/18/multiselect_friendly_mdx.aspx). The following MDX query (running on the Adventure Works cube database) illustrates the use of EXISTING.

WITH MEMBER [Measures].[MyDate] AS
Tail(EXISTING [Date].[Calendar].[Date], 1).item(0).membervalue
SELECT {[Measures].[MyDate]} ON 0
FROM [Adventure Works]
WHERE [Date].[Calendar Year].&[2003]

This query will return the last date in calendar year 2003, which of course is december 31 2003. It does this because the WHERE clause slices on calendar year 2003 and the last date to exist with this is december 31 2003.

What happens if we rewrite the query a bit to use a subselect to filter on calendar year 2003 instead of specifying this in the WHERE clause?

WITH MEMBER [Measures].[MyDate] AS
Tail(EXISTING [Date].[Calendar].[Date], 1).item(0).membervalue
SELECT {[Measures].[MyDate]} ON 0
FROM
(SELECT [Date].[Calendar Year].&[2003] ON 0
FROM [Adventure Works])

Now, I might have been the only one, but I certainly expected this query to return the same date as the first one: December 31 2003. This is not the case. Our new, rewritten, query will return August 31 2004 (which is the last available date in the date dimension in the Adventure Works cube database). What this really means is that our subselect does not set the current coordinate in the cube (in Moshas words it "merely does top level Exists with axis and applies visual totals"). This is really important to remember if you ever want to use the EXISTING operator for some fancy MDX calculations, because if your client tools use subselects (which, for example, Excel 12 does extensively I am told), you might run into some unexpected results.

Now, it is not all bad - if our query for instance sets the current coordinate on the date dimension by using one of the hierarchies from this dimension on an axis, we will get our expected result.

WITH MEMBER [Measures].[MyDate] AS
Tail(EXISTING [Date].[Calendar].[Date], 1).item(0).membervalue
SELECT {[Measures].[MyDate]} ON 0,
[Date].[Calendar Year].[Calendar Year] ON 1
FROM
(SELECT [Date].[Calendar Year].&[2003] ON 0
FROM [Adventure Works])

UPDATE 20060610: The problems with subselects mentioned above are not limited to the use of the EXISTING operator. They apply in all situations where the current context is needed in an MDX expression in order to return a correct result.