A Real Practices Case Study at Xbox LIVE
Authors: Joseph Szymanski, Tyson Solberg, Denny Lee
Technical Reviewers: Lindsey Allen, Akshai Mirchandani, Heidi Steen
One way to improve the performance of distinct count measures is to change the business problem that they attempt to measure (for example, limiting the time range for distinct counts to specific years or months rather than all years). However, when you have exhausted your ability to reduce the solution’s complexity by simplifying business processes, you can turn to other techniques to improve your solution. This paper endeavors to share several that we came across in real-world enterprise usage scenarios.
As originally described in Analysis Services Distinct Count Optimization, distinct count calculations are Storage-Engine-heavy, resulting in heavy disk I/O utilization (that is, higher latencies and slower throughput). This leads us to the first and foremost among those techniques: the use of solid state devices (SSDs), which have taken the world by storm. Given the high initial cost, relative newness, and lower capacity per device, it is often difficult to find a good fit for them in the enterprise. However, Analysis Services dovetails well with the SSD story because SSDs offer ultra-high read speeds for both random and sequential I/O. Analysis Services queries generally access a large amount of relatively static data and SSDs provide the speed and flexibility required. For example:
The engineering team, examining the scenario at hand, endeavored to find out the following:
a) Can distinct counts be made to run faster: a. Given a faster I/O system? b. Given better cube design?
b) Will such improvements be cost effective?
c) Will such improvements improve or hinder scalability, compared with the current design?
The cube design, for the distinct counts, is relatively simplistic, as noted in Figure 1.
Figure 1: Usage Cube Schema
The measure "Online Unique Users" is a distinct count measure that, for the time period of the month selected, scans through 300 million SQL records in the MOLAP cube dataset.
User Acceptance Testing Server
HP DL580 G5 4-socket, quad core, 2.4 GHz, 64 GB RAM server connected to a high end SAN array; Chart Label: "UAT SAN"
Development Server
Dell R710 2-socket, quad core, 2.93 GHz, 72 GB RAM
Disk Arrays
SQL Customer Advisory Team Server
Dell R905 4-socket, quad core, AMD server with 4 Fusion-io ioDrive PCI-E SSD device; Chart Label: "SQLCAT SSD"
Figure 2: Initial Performance Optimization (Lower Is Better)
As can be seen in Figure 2, the use of Fusion-io SSDs resulted in dramatically faster query performance when compared to local drives or SAN drives. For example, for the seven-month query, SSDs were 5.5 times faster than the SAN hardware.
But as with any successful BI performance improvement, Analysis Services users also asked more resource-intensive questions such as: "For the last year, how many distinct people did <X>?" That is, these questions resulted in distinct count queries that covered a year’s worth of data across a substantially larger dataset. In an attempt to find further performance improvements, the engineering team profiled the cube during query time, in production, and found a very odd thing: Disk usage and CPU usage were nowhere near the limits of what the system was able to handle.
This finding was counterintuitive, because all tests were run on a system that was dedicated to this exercise (that is, the system had no other users or processes running concurrently); it was not a matter of resource contention. We then worked with the SQLCAT team and other experts to find out why Analysis Services was not using more system resources to query the cube, starting with an investigation of how the partitions were defined.
Figure 3: Unevenly Distributed Partitions
A more evenly distributed set of distinct values within the partitions results in all four threads completing at approximately the same time, resulting in minimal spinning and wait time while the calculations are completed. Clearly, parallel queries across the partitions had to be part of our solution.
Figure 4: Evenly Distributed Partitions
Figure 4 shows the even distribution of data among the partitions, which is a key concept for distinct count query optimization. The Analysis Services Storage Engine will initially query the header file to determine which data partitions to query for the range of distinct count values. This way, the storage engine queries only partitions that have the values required to complete the calculation.
After extensive testing, we rediscovered some important rules and added them to the distinct count partitioning strategy to ensure all distinct count queries are optimally parallel:
The distinct count measure must be directly contained in the query.
If you partition your cube by the hash of a UserID distinct value, it is important that your query perform a distinct count of the hash of the UserID – not the distinct count of the UserID itself. For fast distinct count query performance, it is important for the distinct count value itself to be placed in its own periodic partition (for example, User 1 repeatedly shows up in only month 1 partition 1, User 100 in month 1 partition 2, and so on) and for the values to be non-overlapping (for example, Users 1-99 in month 1 partition 1, Users 100-199 in month 1, partition 2, and so on). The hashing will cause the records in the same range to be distributed across multiple partitions, therefore losing the non-overlapping behavior. Even if the UserID and the hash of the UserID have the same distribution of data, and even if you partition data by the latter, the header files contain only the range of values associated with the hash of the UserID. This ultimately means that the Analysis Services Storage Engine must query all of the partitions to perform the distinct on the UserID. For more information about these concepts, see the white paper Analysis Services Distinct Count Optimization.
The distinct count values need to be continuous.
As implied in Figures 3 and 4, each partition has a continuous range of values so that the partition contains the values from 100 – 20,000 (in this example). Based on the empirical evidence we gathered in our testing for this case, it appears that distinct count query performance improves if the values within the partitions are continuous.
After we followed these two rules, we were easily able to improve query parallelism with very few changes.
More specifically, we analyzed our data size, selected a month as the coarse time grain for the distinct count measure group partitions, and then sub-selected the data, per month, into <n> partitions, where n is the number of physical CPU cores on the OLAP hardware. We made this decision after we identified a number of options, tested them, and found this particular grain to be the best for our set of data. Other than the partition changes, the cube design stayed the same, and we did not alter any aggregations for this cube. Note, we had followed the established guidelines of the SQLCAT white paper Analysis Services Distinct Count Optimization.
Note: To allow for more repeatable distinct count query comparison, the cube used here contained no aggregations on distinct count measures.
The following lists various combinations of measure group slicer queries and distinct count measures.

Figure 5: Performance after enabling "multi-threaded mode".
(Note that in Figure 5, the V2 cube performed at the same speed, on SSDs, in all environments. We show only the SQLCAT line for simplicity.)
The conclusion is that by adding enough I/O (through SSDs so that I/O was no longer a bottleneck), we were able to find and resolve the algorithmic issues, enabling incredible performance. The key is that we never would have found these algorithmic issues without first removing the I/O channel bottlenecks by the use of SSDs.
In retrospect, not waiting for SSDs was a serious mistake. The cube went live and user satisfaction plummeted. What went wrong? The results shown in Figure 5 were correct, but somehow, performance was awful.
Had this been done sooner, the patterns illustrated in Figures 6 and 7 would have been found.

Figure 6: Distinct count querying comparison of different access patterns

Figure 7: Distinct count query comparison of different access patterns (time to respond to first query)
To execute the parallelization scenario, the engineering team used a tool to execute multiple Analysis Services queries in parallel, with the following characteristics:
From a technical perspective, SSDs allow many more threads of execution to run in parallel without incurring huge I/O wait times, because their random I/O throughput is basically the same as its sequential I/O throughput. This benefit is relevant because multiple independent queries, serviced simultaneously, implicitly cause random I/O at the disk level, and unlike rotating disks, SSD devices do not slow down under random I/O. Though rotating disks slow down to a three-digit number of I/O operations per second when access is random, high-end SSD devices continue to deliver five-digit number of I/O operations per second, sequential or random. This directly translates into more parallel queries, and therefore more concurrent users, per server when its I/O system is based on high end SSD technology.
http://sqlcat.com/technicalnotes/archive/2010/09/20/analysis-services-distinct-count-optimization-using-solid-state-devices.aspx
Authors: Joseph Szymanski, Tyson Solberg, Denny Lee
Technical Reviewers: Lindsey Allen, Akshai Mirchandani, Heidi Steen
Executive Summary
To expand on the distinct count optimization techniques provided in the Analysis Services Distinct Count Optimization white paper, this technical note shows how using solid state devices (SSDs) can improve distinct count measures. We recount the experiences of the Microsoft Entertainment and Devices Data Warehousing Team (known for Xbox, Xbox LIVE, XNA, and Zune) in our analysis of applying SSDs to a real-world, distinct count heavy, Microsoft SQL Server Analysis Services customer environment. The key conclusion is that enterprise SSD devices, when combined with a well optimized Analysis Services MOLAP cube, will drastically improve the performance and scalability of the cube when it accesses distinct count measures. It can also improve non-distinct count measures, if the calculations being performed rely heavily on storage-engine calculations.Purpose
Analysis Services distinct count measures are extremely expensive in all aspects of an Analysis Services solution – the time requirements for processing the data, the long query durations, and large storage space requirements. Often the best approach is to convince the analysts using your cubes to use alternate measures or calculations. However, in many cases the distinct count-based Key Performance Indicators (KPIs) are key components of business analytics systems. In such cases, the focus has to move from "Are you sure you need distinct count?" to "How can we make distinct count queries fast(er)?"One way to improve the performance of distinct count measures is to change the business problem that they attempt to measure (for example, limiting the time range for distinct counts to specific years or months rather than all years). However, when you have exhausted your ability to reduce the solution’s complexity by simplifying business processes, you can turn to other techniques to improve your solution. This paper endeavors to share several that we came across in real-world enterprise usage scenarios.
As originally described in Analysis Services Distinct Count Optimization, distinct count calculations are Storage-Engine-heavy, resulting in heavy disk I/O utilization (that is, higher latencies and slower throughput). This leads us to the first and foremost among those techniques: the use of solid state devices (SSDs), which have taken the world by storm. Given the high initial cost, relative newness, and lower capacity per device, it is often difficult to find a good fit for them in the enterprise. However, Analysis Services dovetails well with the SSD story because SSDs offer ultra-high read speeds for both random and sequential I/O. Analysis Services queries generally access a large amount of relatively static data and SSDs provide the speed and flexibility required. For example:
- Analysis Services scenarios are well suited for SSDs because most are designed for fast read performance.
- The biggest benefit that SSDs offer over physical disks is that they provide random I/O read speed that is nearly as high as sequential I/O read speed – orders of magnitude faster than spin disks.
- Data stored in MOLAP cubes is an ideal target for SSDs: most SSDs are rather low in capacity, but Analysis Services MOLAP cubes are generally small in size when compared to their data warehouse source. For example, in our sample case we have a 10-terabyte warehouse, which would cost an extraordinary amount of money to move to SSDs, but a 160 GB cube, which would be very easy and inexpensive to move to SSDs.
- While the initial costs of SSDs are higher than those of spin disks, the overall lifetime costs of SSDs are comparable to spin disks because of cooling costs, differences in power consumption, and general maintenance costs associated with your storage (SSDs typically have lower maintenance costs).
Scenario
This paper covers the business scenario in which business users seek to improve the performance of their Analysis Services distinct count operations. As an example, the Microsoft SQLCAT team worked with the people at the Microsoft Entertainment and Devices Data Warehousing who needed to build their critical KPIs based on distinct counts. We decided that a significant research effort to find a way to make their distinct count queries run faster through design methodologies and the use of SSDs was worthwhile.The engineering team, examining the scenario at hand, endeavored to find out the following:
a) Can distinct counts be made to run faster: a. Given a faster I/O system? b. Given better cube design?
b) Will such improvements be cost effective?
c) Will such improvements improve or hinder scalability, compared with the current design?
Datasets
The database is a real-world production dataset with the following characteristics.Dataset | Sizes |
MOLAP Cube | 120 GB |
SQL DW | 10.0 terabytes |

Figure 1: Usage Cube Schema
OLAP Query Characteristics
One of the primary business drivers that needed an answer is: "How many distinct users used the service for <a time period>?" To get this answer, we focused on a single simple MDX statement and sought to make it run as fast as possible from cold cache (that is, we cleared all existing caches to force the data to load from the pertinent device).SELECT [Measures].[Online Unique Users] ON 0, |
[Date].[Date].[Date] ON 1 |
FROM [Usage] |
WHERE |
[Date].[Calendar Month].[Calendar Month].&[2009]&[1] |
Test Hardware
We utilized a number of servers to perform our query tests.User Acceptance Testing Server
HP DL580 G5 4-socket, quad core, 2.4 GHz, 64 GB RAM server connected to a high end SAN array; Chart Label: "UAT SAN"
Development Server
Dell R710 2-socket, quad core, 2.93 GHz, 72 GB RAM
Disk Arrays
- 1 Fusion-io ioDrive PCI-E SSD device; Chart Label: "Dev SSD"
- Two Dell MD1000 enclosures (16 x 750 GB 7200RPM drives); Chart Label: "Dev Hard Drives"
SQL Customer Advisory Team Server
Dell R905 4-socket, quad core, AMD server with 4 Fusion-io ioDrive PCI-E SSD device; Chart Label: "SQLCAT SSD"
We’d like to thank Fusion-io, Dell, and Hewlett-Packard for the use of and support with their hardware. |
Analysis
As originally described in Analysis Services Distinct Count Optimization, distinct count calculations are Storage-Engine-heavy, resulting in heavy disk I/O utilization. Therefore, the original hypothesis for this case study was modeled after the reasonable argument: "If we are disk I/O bound, and Analysis Services provides a 100 percent random read load, SSDs should drastically improve performance."Initial Query Performance Comparison Between SSDs and Spin Disks
Let’s start by comparing the query performance between SSDs and spin disks (local hard drives and SAN).
Figure 2: Initial Performance Optimization (Lower Is Better)
As can be seen in Figure 2, the use of Fusion-io SSDs resulted in dramatically faster query performance when compared to local drives or SAN drives. For example, for the seven-month query, SSDs were 5.5 times faster than the SAN hardware.
But as with any successful BI performance improvement, Analysis Services users also asked more resource-intensive questions such as: "For the last year, how many distinct people did <X>?" That is, these questions resulted in distinct count queries that covered a year’s worth of data across a substantially larger dataset. In an attempt to find further performance improvements, the engineering team profiled the cube during query time, in production, and found a very odd thing: Disk usage and CPU usage were nowhere near the limits of what the system was able to handle.
This finding was counterintuitive, because all tests were run on a system that was dedicated to this exercise (that is, the system had no other users or processes running concurrently); it was not a matter of resource contention. We then worked with the SQLCAT team and other experts to find out why Analysis Services was not using more system resources to query the cube, starting with an investigation of how the partitions were defined.
The Move to an Optimally Parallel Cube
As noted in Analysis Services Distinct Count Optimization, partitioning significantly improves distinct count query performance. By creating distinct buckets based on distinct value and time, you can significantly improve distinct count query performance by forcing the Analysis Services Storage Engine to fire off many more threads – one for each partition – and therefore more quickly calculate the distinct value. But if partitions are designed with an uneven number of distinct values (such as in Figure 3), the query may ultimately become single-threaded (even though all four partitions are being queried) because the Analysis Services Storage Engine is waiting for the largest partition (data file with values from 1,500 to 20,000) to complete its calculations. This behavior explained the puzzling results around disk and CPU consumption in the earlier tests.
Figure 3: Unevenly Distributed Partitions
A more evenly distributed set of distinct values within the partitions results in all four threads completing at approximately the same time, resulting in minimal spinning and wait time while the calculations are completed. Clearly, parallel queries across the partitions had to be part of our solution.

Figure 4: Evenly Distributed Partitions
Figure 4 shows the even distribution of data among the partitions, which is a key concept for distinct count query optimization. The Analysis Services Storage Engine will initially query the header file to determine which data partitions to query for the range of distinct count values. This way, the storage engine queries only partitions that have the values required to complete the calculation.
After extensive testing, we rediscovered some important rules and added them to the distinct count partitioning strategy to ensure all distinct count queries are optimally parallel:
The distinct count measure must be directly contained in the query.
If you partition your cube by the hash of a UserID distinct value, it is important that your query perform a distinct count of the hash of the UserID – not the distinct count of the UserID itself. For fast distinct count query performance, it is important for the distinct count value itself to be placed in its own periodic partition (for example, User 1 repeatedly shows up in only month 1 partition 1, User 100 in month 1 partition 2, and so on) and for the values to be non-overlapping (for example, Users 1-99 in month 1 partition 1, Users 100-199 in month 1, partition 2, and so on). The hashing will cause the records in the same range to be distributed across multiple partitions, therefore losing the non-overlapping behavior. Even if the UserID and the hash of the UserID have the same distribution of data, and even if you partition data by the latter, the header files contain only the range of values associated with the hash of the UserID. This ultimately means that the Analysis Services Storage Engine must query all of the partitions to perform the distinct on the UserID. For more information about these concepts, see the white paper Analysis Services Distinct Count Optimization.
The distinct count values need to be continuous.
As implied in Figures 3 and 4, each partition has a continuous range of values so that the partition contains the values from 100 – 20,000 (in this example). Based on the empirical evidence we gathered in our testing for this case, it appears that distinct count query performance improves if the values within the partitions are continuous.
After we followed these two rules, we were easily able to improve query parallelism with very few changes.
More specifically, we analyzed our data size, selected a month as the coarse time grain for the distinct count measure group partitions, and then sub-selected the data, per month, into <n> partitions, where n is the number of physical CPU cores on the OLAP hardware. We made this decision after we identified a number of options, tested them, and found this particular grain to be the best for our set of data. Other than the partition changes, the cube design stayed the same, and we did not alter any aggregations for this cube. Note, we had followed the established guidelines of the SQLCAT white paper Analysis Services Distinct Count Optimization.
Note: To allow for more repeatable distinct count query comparison, the cube used here contained no aggregations on distinct count measures.
The following lists various combinations of measure group slicer queries and distinct count measures.
SQL Query WHERE clause | Analysis Services distinct count member | Is the query optimally parallel? If not, why? |
WHERE userid % 16 = 0 | userid | NO: Query does not return a continuous dataset. |
WHERE CAST(HASHBYTES('SHA1',CAST(userid AS VARCHAR)) AS BIGINT) BETWEEN a AND b | userid | NO: The Analysis Services member "userid" is not contained directly in the query. |
WHERE userid BETWEEN a AND b | userid | YES |
WHERE userid % 16 = 0 | CAST(HASHBYTES('SHA1',CAST (userid AS varchar)) AS bigint) | NO: Query does not return a continuous dataset. |
WHERE CAST(HASHBYTES('SHA1',CAST(userid AS VARCHAR)) AS BIGINT) BETWEEN a AND b | CAST(HASHBYTES('SHA1',CAST (userid AS varchar)) AS bigint) | YES |
WHERE userid BETWEEN a AND b | CAST(HASHBYTES('SHA1',CAST (userid AS varchar)) AS bigint) | NO: The Analysis Services member <Hash of Userid> is not directly in the query. |
Now That We Have an Optimally Parallel Cube…
The results were stunning, as shown by the "V2" lines for both SSDs and hard disk drives (HDDs) (where "V2" is the version 2 cube, which follows the optimizations discussed earlier in this paper).
Figure 5: Performance after enabling "multi-threaded mode".
(Note that in Figure 5, the V2 cube performed at the same speed, on SSDs, in all environments. We show only the SQLCAT line for simplicity.)
The conclusion is that by adding enough I/O (through SSDs so that I/O was no longer a bottleneck), we were able to find and resolve the algorithmic issues, enabling incredible performance. The key is that we never would have found these algorithmic issues without first removing the I/O channel bottlenecks by the use of SSDs.
But Wait, Processes Get in the Way!
At this point, due to operational issues, we were initially unable to go live with the SSD servers. It was agreed that given these results, we should go with the easier-to-implement DAS HDD solution, which offered similar performance. Specifically, our challenge to going live was that, due to a lack of SSD enterprise standards as of mid-2009, the supportability story was too complicated to be sustainable across a large number of servers.In retrospect, not waiting for SSDs was a serious mistake. The cube went live and user satisfaction plummeted. What went wrong? The results shown in Figure 5 were correct, but somehow, performance was awful.
Parallel User Load
Figure 5, though accurate, shows query response times for a single user only. It does show accurately that, with a solidly parallel cube, good DAS can be nearly as fast as SSDs for reasonably large MOLAP cubes. But a more in-depth analysis found that, in our initial analysis, we failed to consider a highly parallel user load and failed to benchmark a large enough multi-user parallel Analysis Services query load.Had this been done sooner, the patterns illustrated in Figures 6 and 7 would have been found.

Figure 6: Distinct count querying comparison of different access patterns

Figure 7: Distinct count query comparison of different access patterns (time to respond to first query)
To execute the parallelization scenario, the engineering team used a tool to execute multiple Analysis Services queries in parallel, with the following characteristics:
- Each query selected a distinct dataset – no overlaps between data.
- Each query was run in two concurrency modes: all at the same time, and with 30 seconds between queries, to simulate a real user load.
- Three access patterns were selected and timed: Running six queries serially, running six queries concurrently, and running twelve queries concurrently.
- Each test run was executed multiple times, clearing caches between runs; the times indicated are averages.
- Two measures were recorded: the time for the first submitted query to complete, and the total time for all queries to complete.
From a technical perspective, SSDs allow many more threads of execution to run in parallel without incurring huge I/O wait times, because their random I/O throughput is basically the same as its sequential I/O throughput. This benefit is relevant because multiple independent queries, serviced simultaneously, implicitly cause random I/O at the disk level, and unlike rotating disks, SSD devices do not slow down under random I/O. Though rotating disks slow down to a three-digit number of I/O operations per second when access is random, high-end SSD devices continue to deliver five-digit number of I/O operations per second, sequential or random. This directly translates into more parallel queries, and therefore more concurrent users, per server when its I/O system is based on high end SSD technology.
Recommendations
Here is a list of conclusions we drew from the work we did for the Microsoft Entertainment and Devices Data Warehousing Team. Most have been discussed in this paper, but some just general best practices we want to share with you:- Remove the I/O bottlenecks by adding fast enough underlying disk I/O. Their absence makes it easier to find algorithmic bottlenecks in SQL Server Analysis Services implementations.
- The Analysis Services workload is well suited, for distinct counts (and Storage-Engine-heavy query loads), to an SSD I/O backend.
- When you evaluate changes to an Analysis Services cube, testing single-user query performance is not enough. If you do not create an independently parallel load, you are not properly simulating usage patterns, because your users are creating parallel loads.
- It is critical to be aware of your production workload, to monitor the queries being run and the performance of the system servicing the queries.
- Even a simple query can stress Analysis Services distinct count performance – it is critical to consider the size of the dataset that a distinct count query returns to accurately assess the query’s performance.
- Follow these rules for making sure Analysis Services can parallelize distinct count queries, in addition to the existing standards and practices for partitioning:
- Make sure that the distinct count attribute is directly used in the partitioning query.
- Make sure that the partitioning query function (for all sub-partitions in a single time period) is continuous. Using a hash function and BETWEEN is one way to do this that works well.
- When benchmarking, if you are testing cold-cache scenarios, be sure that you run multiple times and clear all caches between runs. Don’t accept a result as true until you can reproduce it.
Summary
The results of our analysis are clear: SSD technology has significant benefits for MOLAP-based Analysis Services solutions. Because concurrent users implicitly create random I/O patterns, solid-state devices enable greater scalability and per-user performance. In the past, before the advent of SSD technology, getting very high end parallel random I/O performance required a complex and very expensive solution. SSDs offer these benefits without the prohibitively high cost.http://sqlcat.com/technicalnotes/archive/2010/09/20/analysis-services-distinct-count-optimization-using-solid-state-devices.aspx
댓글 없음:
댓글 쓰기