By Vivek R. Gupta, Senior Consultant
This white paper introduces data warehousing concepts. You may also download this white paper in MS Word format:
An Introduction to Data Warehousing
Whitepaper: Introduction to Data Warehousing
Data warehousing has quickly evolved into a unique and popular business
application class. Early builders of data warehouses already consider their systems
to be key components of their IT strategy and architecture. Numerous examples can
be cited of highly successful data warehouses developed and deployed for businesses
of all sizes and all types. Hardware and software vendors have quickly developed
products and services that specifically target the data warehousing market. This
paper will introduce key concepts surrounding the data warehousing systems.
What is a data warehouse? A simple answer could be that a data warehouse
is managed data situated after and outside the operational systems. A complete definition
requires discussion of many key attributes of a data warehouse system. Later in
Section 2, we will identify these key attributes and discuss the definition they
provide for a data warehouse. Section 3 briefly reviews the activity against a data
warehouse system. Initially in Section 1, however, we will take a brief tour of
the traditions of managing data after it passes through the operational systems
and the types of analysis generated from this historical data.
Evolution of an application class
This section reviews the historical management of the analysis data and
the factors that have led to the evolution of the data warehousing application class.
Traditional approaches to historical data
In reviewing the development of data warehousing, we need to begin with
a review of what had been done with the data before of evolution of data warehouses.
Let us first look at how the kind of data that ends up in today's data warehouses
had been managed historically.
Throughout the history of systems development, the primary emphasis had
been given to the operational systems and the data they process. It is not practical
to keep data in the operational systems indefinitely; and only as an afterthought
was a structure designed for archiving the data that the operational system has
processed. The fundamental requirements of the operational and analysis systems
are different: the operational systems need performance, whereas the analysis systems
need flexibility and broad scope. It has rarely been acceptable to have business
analysis interfere with and degrade performance of the operational systems.
Data from legacy systems
In the 1970s virtually all business system development was done on the
IBM mainframe computers using tools such as Cobol, CICS, IMS, DB2, etc. The 1980s
brought in the new mini-computer platforms such as AS/400 and VAX/VMS. The late
eighties and early nineties made UNIX a popular server platform with the introduction
of client/server architecture.
Despite all the changes in the platforms, architectures, tools, and technologies,
a remarkably large number of business applications continue to run in the mainframe
environment of the 1970s. By some estimates, more than 70 percent of business data
for large corporations still resides in the mainframe environment. There are many
reasons for this. The most important reason, and one that is particularly relevant
to our topic, is that over the years these systems have grown to capture the business
knowledge and rules that are incredibly difficult to carry to a new platform or
application.
These systems, generically called legacy systems, continue to be the
largest source of data for analysis systems. The data that is stored in DB2, IMS,
VSAM, etc. for the transaction systems ends up in large tape libraries in remote
data centers. An institution will generate countless reports and extracts over the
years, each designed to extract requisite information out of the legacy systems.
In most instances, IS/IT groups assume responsibility for designing and developing
programs for these reports and extracts. The time required to generate and deploy
these programs frequently turns out to be longer than the end users think they can
afford.
Extracted information on the Desktop
During the past decade, the sharply increasing popularity of the personal
computer on business desktops has introduced many new options and compelling opportunities
for business analysis. The gap between the programmer and end user has started to
close as Business Analysts now have at their fingertips many of the tools required
to gain proficiency in the use of spreadsheets for analysis and graphic representation.
Advanced users will frequently use desktop database programs that allow them to
store and work with the information extracted from the legacy sources. Many desktop
reporting and analysis tools are increasingly targeted towards end users and have
gained considerable popularity on the desktop.
The downside of this model for business analysis is that it leaves the
data fragmented and oriented towards very specific needs. Each individual user has
obtained only the information that he or she requires. Not being standardized, the
extracts are unable to address the requirements of multiple users and uses. The
time and cost involved in addressing the requirements of only one user prove prohibitive.
This approach to data management assumes the end user has the time to expend on
managing the data in the spreadsheets, files, and databases. While many of these
users may be proficient at data management, most undertake these tasks as a necessity.
And given the choice, most users would find it more efficient to focus on the actual
analysis and the tools available to them.
Decision-Support and Executive Information Systems
Another category of popular analysis systems has been decision support
systems and executive information systems. Decision support systems tend to focus
more on detail and are targeted towards lower to mid-level managers. Executive information
systems have generally provided a higher level of consolidation and a multi-dimensional
view of the data, as high level executives need more the ability to slice and dice
the same data than to drill down to review the data detail.
These two similar and overlapping categories are perhaps the closest
precursors to the data warehousing systems. Yet the high price of their development
and the coordination required for their production made them an elite product that
never entered the mainstream. The following are some characteristics generally associated
with decision support or executive information systems:
- These systems have data in descriptive standard business terms, rather
than in cryptic computer fields names. Data names and data structures in these systems
are designed for use by non-technical users.
- The data is generally preprocessed with the application of standard
business rules such as how to allocate revenue to products, business units, and
markets.
- Consolidated views of the data such as product, customer, and market
are available. Although these systems will at times have the ability to drill down
to the detail data, rarely are they able to access all the detail data at the same
time.
Today's data warehousing systems provide the analytical tools afforded
by their precursors. But their design is no longer derived from the specific requirements
of analysts or executives; and, as we will see later, data warehousing systems are
most successful when their design aligns with the overall business structure rather
than specific requirements.
Emergence of key enabling technologies
Many factors have influenced the quick evolution of the data warehousing
discipline. The most significant set of factors has been the enormous forward movement
in the hardware and software technologies. Sharply decreasing prices and the increasing
power of computer hardware, coupled with ease of use of today's software, has made
possible quick analysis of hundreds of gigabytes of information and business knowledge.
Hardware prices plummeting according to the Moore's law
The most important factor in the evolution of data warehousing has been
the sharply increasing power of computer hardware. Along with the increase in this
power, their prices have fallen just as sharply. Gordon Moore, co-founder of Intel,
predicted that the capacity of a microprocessor will double every 18 months. This
has not only held true for the processor but also for other components of the computer.
While desktop computers today are more powerful than the mainframes of yesterday,
an inexpensive server possesses power that was difficult to imagine just a decade
ago.
The Pentium II and Alpha processors have brought incredible power to
the commodity computer market. Sophisticated processor hardware architectures such
as symmetric multi-processing have come to the mainstream computing with inexpensive
machines. Higher capacity memory chips, a key component influencing the performance
of a data warehouse system, are now available at very low prices. Now it is possible
to have a moderately priced machine with 1 or 2 gigabytes of memory. Computer Bus
such as PCI and controller interfaces such as Ultra SCSI have made I/O incredibly
fast. Last but not the least, the disk drive has shrunk to hold amazing amounts
of information. Just two decades ago, it would have taken a roomful of disk drives
to store information that can now be easily stored on a single one-inch high disk
drive.
Desktop power increasing
Entering the market as a novelty computer in early eighties, the personal
computer has become the hotbed in innovation during the past decade. The personal
computer was initially used for word processing and other minor tasks with no links
to primary analytical functions. With the help of innovations such as powerful personal
productivity software, easy–to-use graphical interface, and responsive
business applications, the personal computer has become the focal point of all computing
today. The powerful desktop hardware and software has allowed for development of
the client/server or multi-tier computing architecture. Almost all data warehouses
are accessed by personal computer based tools. These tools vary from very simple
query capabilities available with most productivity packages to incredibly powerful
graphical multi-dimensional analysis tools. Without the wide array of choices available
for a data warehouse access, data warehousing would not have evolved so quickly.
Ever increasing power of server software
Server operating systems such as Windows NT and Unix have brought mission-critical
stability and powerful features to the distributed computing environment. The operating
system software has become very feature-rich and powerful as the cost has been going
down steadily. With this combination, sophisticated operating system concepts such
virtual memory, multi-tasking, and symmetric multi-processing are now available
on inexpensive operating platforms. Operating systems such as Windows NT have made
these powerful systems very easy to set up and operate reducing the total cost of
ownership of these powerful servers.
Explosion of Intranets and Web based applications
The most important development in computing since the advent of the personal
computer is the explosion of Internet and Web based applications. Somewhat after
the fact, the business community has quickly jumped onto the Internet bandwagon.
One of the most exciting fields in computing industry today is the development
of Intranet applications. Intranets are private business networks that are based
on the Internet standards, although they are designed to be used internally. The
Internet/Intranet trend has very important implications for data warehousing applications.
First, data warehouses can be available world wide on public/private network at
much lower cost. This availability minimizes the need to replicate data across diverse
geographical locations. Second, this standard has allowed the web server to provide
a middle tier where all the heavy-duty analysis takes place before it is presented
to the web-browsing client to use.
The skyrocketing power of hardware and software, along with the availability
of affordable and easy-to-use reporting and analysis tools have played the most
important role in evolution of data warehouses. Figure 1 highlights the technological
revolution that has greatly impacted data warehousing.
Change in the nature of the business
Another very significant influence on evolution of data warehousing science
is the fundamental changes in the business organization and structure during late
eighties and early nineties. The emergence of a vibrant global economy has profoundly
changed the information demands made by corporations in the United States and worldwide.
Corporations have found markets for their products globally while competing with
other companies in vastly different cultures and economic environments. The mergers
and acquisition of businesses have crossed the country boundaries.
Economic factors of the recent years
The economic downturn of the late eighties led many global corporations
through a remarkable period of consolidation. Phenomena such as "business process
reengineering" and "downsizing" forced businesses to reevaluate their
business practices. Many industries went through prolonged periods of consolidation
and reinvention. During this period, simple economics forced the businesses to identify
their core competency areas and shed businesses that were not profitable.
These economic factors have played an important role in the evolution
of data warehousing. For example, when a banking unit that used different operational
systems changed hands, the top management still needed to view the consolidated
business and manage the associated risks accordingly. The banking industry has been
the leader in the use of data warehouses. Today's data warehousing systems are extensively
used for profitability and customer behavior analysis.
Global corporation
The fall of communism and liberalization of Asian and South American
economies has changed the business climate worldwide forever. Competition from emerging
economies has forced large corporations to become lean and efficient. The emergence
of this global economy has led to the migration of manufacturing to less expensive
and less restrictive countries. Former communist and South American countries present
very exciting and challenging business opportunities. Along with these opportunities
they present a very volatile business climate and economies that are nearly impossible
to predict.
Businesses have not only focused on building products worldwide, but
they have also changed their organization to sell products around the globe. Trade
agreements such as NAFTA and EEC greatly impact the decisions to enter markets or
build factories. This globalization of business has increased the need not merely
for more continuous analysis, but also to manage data in a centralized location.
The process of rolling up manufacturing and sales data from far-flung business units
has now started to impact much larger number of corporations. Businesses now need
to continuously make the "build or buy" decisions. Globalization of business
has made the consolidation of data in a central data warehouse more complicated.
Factors such as currency fluctuations and product customization for different markets
have added complexity to data warehousing, making the analysis much more complicated.
Imagine trying to assess profitability of products built and sold in multiple countries
with volatile currencies. Or, attempting to hedge the risks of downturn in economies
that have been expanding rapidly for extended periods.
Emergence of standard business applications
Another factor that is fast becoming an important variable in data warehousing
equations is the emergence of vendors with popular business application suites.
Led by wildly popular German software vendor SAP AG, flexible business software
suites adapted to the particulars of a business have become a very popular way to
move to a sophisticated multi-tier architecture. Other vendors such as Baan, PeopleSoft,
and Oracle have likewise come out with suites of software that provide different
strengths but have comparable functionality.
The emergence of these application suites has a direct bearing on the
increased use of data warehousing in that they are increasingly able to provide
standard applications that are replacing existing custom developed legacy applications.
In the near future, almost every data warehouse is likely to derive data from one
of these application sources rather than the customized extraction from legacy systems.
Further, there are significant initiatives at these vendors to make transaction
data easily available to data warehousing systems. To the extent that these standard
applications have extensive customization features, data acquisition from these
applications can be much simpler than from the mainframe systems.
End-user more technology savvy
One of the most important results of the massive investment in technology
and movement towards the powerful personal computer has been the evolution of a
technology-savvy business analyst. Even though the technology-savvy end users are
not always beneficial to all projects, this trend certainly has produced a crop
of technology-leading business analysts that are becoming essential to today's business.
These technology-savvy end users have frequently played an important role in the
development and deployment of data warehouses. They have become the core users that
are first to demonstrate the initial benefits of data warehouses. These end users
are also critical to the development of the data warehouse model: as they become
experts with the data warehousing system, they play a very important role of mentoring
other users.
Word processing and spreadsheets were the first applications to be effectively
used on the personal computers. In fact, the spreadsheet is said to be the killer
application that led to widespread deployment of personal computers. The charting
functions from a spreadsheet represent one of the most extensively used business
analysis and presentation functions. The new pivot tables available in popular spreadsheets
have allowed for simple multi-dimensional analysis. The aggressive use of inexpensive
personal productivity software has led to use of more robust reporting and analysis
tools along with more powerful desktop database engines. These powerful tools are
now more targeted towards the end user and often require very little training for
simple applications.
Management more information conscience
Many factors affect the heightened awareness of trends in information
technology among mid and upper management levels. Unlike a decade ago, the information
technology now is nearly universally accepted as a key strategic business asset.
Many mid and upper level managers that have risen through the ranks over the last
decade have invariably made their mark with successful technology investments. As
a result, they tend not to shy away from risking resources on new and emerging technologies.
The explosive use of Internet has greatly aided in the managers' awareness of technology
trends. The Internet is now being used to conduct business transactions; but its
greatest asset to this date has been dissemination of information. Today, executives
can not only review various sources of industry trends, they can also readily find
case studies and vendor information.
The use of technology by mid and upper level managers has increased significantly.
They have decisively moved beyond using the personal computer for email. This hands-on
use of information and technology by upper management has facilitated the sponsorship
of larger projects such as data warehousing.
Alongside the availability of key enabling technologies, these fundamental
changes in the nature of business over the past decade have played a central role
in the evolution of data warehouse. Some might even argue that these changes in
business have led the technology to its current state.
Data warehousing attributes and concepts
Having looked at the historical use of the analysis data and explored
some of the factors influencing the evolution of data warehouses, we will now turn
to identifying the key attributes of a data warehouse.
It is important to recognize that data warehousing is still an evolving
science. As with any evolving technology, particular care must be taken to discount
some marketing claims driven by vendors attempting to differentiate themselves from
the competitors. For example, the size of the data warehouse should not determine
if a data warehouse is really a data warehouse. Some vendor may say that a data
warehouse that is only 50 gigabytes is not a full-fledged data warehouse, and they
may refer to it instead as a data mart. For a smaller company, 50 gigabytes or even
much less can represent every relevant piece of information covering last 10 years
and can well represent a powerful data warehouse.
This section explores the data warehousing concepts and attributes. These
concepts are grouped into four sub-sections. The first sub-section discusses the
reasons for separating the data for business analysis from the operational data.
The logical transformation of the data, including data warehouse modeling and de-normalization
of the data, are introduced in the second sub-section. Sub-section three reviews
the issues associated with physical transformation of the data. Sub-section four
discusses the generation of summary views. A very simple and broad definition of
a data warehouse follows the discussion of the data warehousing concepts and attributes.
"Warehousing" data outside the operational systems
The primary concept of data warehousing is that the data stored for business
analysis can most effectively be accessed by separating it from the data in the
operational systems. Many of the reasons for this separation have evolved over the
years. In the past, legacy systems archived data onto tapes as it became inactive
and many analysis reports ran from these tapes or mirror data sources to minimize
the performance impact on the operational systems.
These reasons to separate the operational data from analysis data have
not significantly changed with the evolution of the data warehousing systems, except
that now they are considered more formally during the data warehouse building process.
Advances in technology and changes in the nature of business have made many of the
business analysis processes much more complex and sophisticated. In addition to
producing standard reports, today's data warehousing systems support very sophisticated
online analysis including multi-dimensional analysis.
Integrating data from more than one operational system
Data warehousing systems are most successful when data can be combined
from more than one operational system. When the data needs to be brought together
from more than one source application, it is natural that this integration be done
at a place independent of the source applications. Before the evolution of structured
data warehouses, analysts in many instances would combine data extracted from more
than one operational system into a single spreadsheet or a database. The data warehouse
may very effectively combine data from multiple source applications such as sales,
marketing, finance, and production. Many large data warehouse architectures allow
for the source applications to be integrated into the data warehouse incrementally.
The primary reason for combining data from multiple source applications
is the ability to cross-reference data from these applications. Nearly all data
in a typical data warehouse is built around the time dimension. Time is the primary
filtering criterion for a very large percentage of all activity against the data
warehouse. An analyst may generate queries for a given week, month, quarter, or
a year. Another popular query in many data warehousing applications is the review
of year-on-year activity. For example, one may compare sales for the first quarter
of this year with the sales for first quarter of the prior years. The time dimension
in the data warehouse also serves as a fundamental cross-referencing attribute.
For example, an analyst may attempt to access the impact of a new marketing campaign
run during selected months by reviewing the sales during the same periods. The ability
to establish and understand the correlation between activities of different organizational
groups within a company is often cited as the single biggest advanced feature of
the data warehousing systems.
The data warehouse system can serve not only as an effective platform
to merge data from multiple current applications; it can also integrate multiple
versions of the same application. For example, an organization may have migrated
to a new standard business application that replaces an old mainframe-based, custom-developed
legacy application. The data warehouse system can serve as a very powerful and much
needed platform to combine the data from the old and the new applications. Designed
properly, the data warehouse can allow for year-on-year analysis even though the
base operational application has changed.
Differences between transaction and analysis processes
The most important reason for separating data for business analysis from
the operational data has always been the potential performance degradation on the
operational system that can result from the analysis processes. High performance
and quick response time is almost universally critical for operational systems.
The loss of efficiency and the costs incurred with slower responses on the predefined
transactions are usually easy to calculate and measure. For example, a loss of five
seconds of processing time is perhaps negligible in and of itself; but it compounds
out to considerably more time and high costs once all the other operations it impacts
are brought into the picture. On the other hand, business analysis processes in
a data warehouse are difficult to predefine and they rarely need to have rigid response
time requirements.
Operational systems are designed for acceptable performance for pre-defined
transactions. For an operational system, it is typically possible to identify the
mix of business transaction types in a given time frame including the peak loads.
It also relatively easy to specify the maximum acceptable response time given a
specific load on the system. The cost of a long response time can then be computed
by considering factors such as the cost of operators, telecommunication costs, and
the cost of any lost business. For example, an order processing system might specify
the number of active order takers and the average number of orders for each operational
hour. Even the query and reporting transactions against the operational system are
most likely to be predefined with predictable volume.
Even though many of the queries and reports that are run against a data
warehouse are predefined, it is nearly impossible to accurately predict the activity
against a data warehouse. The process of data exploration in a data warehouse takes
a business analyst through previously undefined paths. It is also common to have
runaway queries in a data warehouse that are triggered by unexpected results or
by users' lack of understanding of the data model. Further, many of the analysis
processes tend to be all encompassing whereas the operational processes are well
segmented. A user may decide to explore detail data while reviewing the results
of a report from the summary tables. After finding some interesting sales activity
in a particular month, the user may join the activity for this month with the marketing
programs that were run during that particular month to further understand the sales.
Of course, there would be instances where a user attempts to run a query that will
try to build a temporary table that is a Cartesian product of two tables containing
a million rows each! While an activity like this would unacceptably degrade an operational
system's performance, it is expected and planned for in a data warehousing system.
Data is mostly non-volatile
Another key attribute of the data in a data warehouse system is that
the data is brought to the warehouse after it has become mostly non-volatile. This
means that after the data is in the data warehouse, there are no modifications to
be made to this information. For example, the order status does not change, the
inventory snapshot does not change, and the marketing promotion details do not change.
This attribute of the data warehouse has many very important implications for the
kind of data that is brought to the data warehouse and the timing of the data transfer.
Let us further review what it means for the data to be non-volatile.
In an operational system the data entities go through many attribute changes. For
example, an order may go through many statuses before it is completed. Or, a product
moving through the assembly line has many processes applied to it. Generally speaking,
the data from an operational system is triggered to go to the data warehouse when
most of the activity on these business entity data has been completed. This may
mean completion of an order or final assembly of an accepted product. Once an order
is completed and shipped, it is unlikely to go back to backorder status. Or, once
a product is built and accepted, it is unlikely to go back to the first assembly
station. Another important example can be the constantly changing data that is transferred
to the data warehouse one snapshot at a time. The inventory module in an operational
system may change with nearly every transaction; it is impossible to carry all of
these changes to the data warehouse. You may determine that a snapshot of inventory
carried once every week to the data warehouse is adequate for all analysis. Such
snapshot data naturally is non-volatile.
It is important to realize that once data is brought to the data warehouse,
it should be modified only on rare occasions. It is very difficult, if not impossible,
to maintain dynamic data in the data warehouse. Many data warehousing projects have
failed miserably when they attempted to synchronize volatile data between the operational
and data warehousing systems.
Data saved for longer periods than in transaction systems
Data from most operational systems is archived after the data becomes
inactive. For example, an order may become inactive after a set period from the
fulfillment of the order; or a bank account may become inactive after it has been
closed for a period of time. The primary reason for archiving the inactive data
has been the performance of the operational system. Large amounts of inactive data
mixed with operational live data can significantly degrade the performance of a
transaction that is only processing the active data. Since the data warehouses are
designed to be the archives for the operational data, the data here is saved for
a very long period.
In fact, a data warehouse project may start without any specific plan
to archive the data off the warehouse. The cost of maintaining the data once it
is loaded in the data warehouse is minimal. Most of the significant costs are incurred
in data transfer and data scrubbing. Storing data for more than five years is very
common for data warehousing systems. There are industry examples were the success
of a data warehousing project has encouraged the managers to expand the time horizon
of the data stored in the data warehouse. They may start with storing the data for
two or three years and then expand to five or more years once the wealth of business
knowledge in the data warehouse is discovered. The falling prices of hardware have
also encouraged the expansion of successful data warehousing projects.
In short, the separation of operational data from the analysis data is
the most fundamental data warehousing concept. Not only is the data stored in a
structured manner outside the operational system, businesses today are allocating
considerable resources to build data warehouses at the same time that the operational
applications are deployed. Rather than archiving data to a tape as an afterthought
of implementing an operational system, data warehousing systems have become the
primary interface for operational systems. Figure 3 highlights the reasons for separation
discussed in this section.
Logical transformation of operational data
This sub-section explores the concepts associated with the data warehouse
logical model. The data is logically transformed when it is brought to the data
warehouse from the operational systems. The issues associated with the logical transformation
of data brought from the operational systems to the data warehouse may require considerable
analysis and design effort. The architecture of the data warehouse and the data
warehouse model greatly impact the success of the project. This section reviews
some of the most fundamental concepts of relational database theory that do not
fully apply to data warehousing systems. Even though most data warehouses are deployed
on relational database platforms, some basic relational principles are knowingly
modified when developing the logical and physical model of the data warehouses.
Structured extensible data model
The data warehouse model outlines the logical and physical structure
of the data warehouse. Unlike the archived data of the legacy systems, considerable
effort needs to be devoted to the data warehouse modeling. This data modeling effort
in the early phases of the data warehousing project can yield significant benefits
in the form of an efficient data warehouse that is expandable to accommodate all
of the business data from multiple operational applications.
The data modeling process needs to structure the data in the data warehouse
independent of the relational data model that may exist in any of the operational
systems. As discussed later in this paper, the data warehouse model is likely to
be less normalized than an operational system model. Further, the operational systems
are likely to have large amounts of overlapping business reference data. Information
about current products is likely to be used in varying forms in many of the operational
systems. The data warehouse system needs to consolidate all of the reference data.
For example, the operational order processing system may maintain the pricing and
physical attributes of products whereas the manufacturing floor application may
maintain design and formula attributes for the same product. The data warehouse
reference table for products would consolidate and maintain all attributes associated
with products that are relevant for the analysis processes. Some attributes that
are essential to the operational system are likely to be deemed unnecessary for
the data warehouse and may not be loaded and maintained in the data warehouse.
The data warehouse model needs to be extensible and structured such that
the data from different applications can be added as a business case can be made
for the data. A data warehouse project in most cases cannot include data from all
possible applications right from the start. Many of the successful data warehousing
projects have taken an incremental approach to adding data from the operational
systems and aligning it with the existing data. They start with the objective of
eventually adding most if not all business data to the data warehouse. Keeping this
long-term objective in mind, they may begin with one or two operational applications
that provide the most fertile data for business analysis. Figure 4 illustrates the
extensible architecture of the data warehouse.
Data warehouse model aligns with the business structure
A data warehouse logical model aligns with the business structure rather
than the data model of any particular application. The entities defined and maintained
in the data warehouse parallel the actual business entities such as customers, products,
orders, and distributors. Different parts of an organization may have a very narrow
view of a business entity such as a customer. For example, a loan service group
in a bank may only know about a customer in the context of one or more loans outstanding.
Another group in the same bank may know about the same customer in context of a
deposit account. The data warehouse view of the customer would transcend the view
from a particular part of the business. A customer in the data warehouse would represent
a bank customer that has any kind of business with the bank.
The data warehouse would most likely build attributes of a business entity
by collecting data from multiple source applications. Consider, for example, the
demographic data associated with a bank customer. The retail operational system
may provide some attributes such as social security number, address, and phone number.
A mortgage system or some purchased database may provide with employment, income,
and net worth information.
The structure of the data in any single source application is likely
to be inadequate for the data warehouse. The structure in a single application may
be influenced by many factors, including:
- Purchased Applications: The application data structure may be dictated
by an application that was purchased from a software vendor and integrated into
the business. The user of the application may have very little or no control over
the data model. Some vendor applications have a very generic data model that is
designed to accommodate a large number and types of businesses.
- Legacy Application: The source application may be a very old mostly
homegrown application where the data model has evolved over the years. The database
engine in this application may have been changed more than once without anyone taking
the time to fully exploit the features of the new engine. There are many legacy
applications in existence today where the data model is neither well documented
nor understood by anyone currently supporting the application.
- Platform Limitations: The source application data model may be restricted
by the limitations of the hardware/software platform or development tools and technologies.
A database platform may not support certain logical relationship or there may be
physical limitations on the data attributes.
Figure 5 illustrates the alignment of data warehouse entities with the
business structure. The data warehouse model breaks away from the limitations of
the source application data models and builds a flexible model that parallels the
business structure. This extensible data model is easy to understand by the business
analysts as well as the managers.
Transformation of the operational state information
It is essential to understand the implications of not being able to maintain
the state information of the operational system when the data is moved to the data
warehouse. Many of the attributes of entities in the operational system are very
dynamic and constantly modified. Many of these dynamic operational system attributes
are not carried over to the data warehouse; others are static by the time they are
moved to the data warehouse. A data warehouse generally does not contain information
about entities that are dynamic and constantly going through state changes.
To understand what it means to lose the operational state information,
let us consider the example of an order fulfillment system that tracks the inventory
to fill orders. First let us look at the order entity in this operational system.
An order may go through many different statuses or states before it is fulfilled
or goes to the "closed" status. Other order statuses may indicate that
the order is ready to be filled, it is being filled, back ordered, ready to be shipped,
etc. This order entity may go through many states that capture the status of the
order and the business processes that have been applied to it. It is nearly impossible
to carry forward all of attributes associated with these order states to the data
warehousing system. The data warehousing system is most likely to have just one
final snapshot of this order. Or, as the order is ready to be moved into the data
warehouse, the information may be gathered from multiple operational entities such
as order and shipping to build the final data warehouse order entity.
Now let us consider the more complicated example of inventory data within
this system. The inventory may change with every single transaction. The quantity
of a product in the inventory may be reduced by an order fulfillment transaction
or this quantity may be increased with receipt of a new shipment of the product.
If this order processing system executes ten thousand transactions in a given day,
it is likely that the actual inventory in the database will go through just as many
states or snapshots during this day. It is impossible to capture this constant change
in the database and carry it forward to the data warehouse. This is still one of
the most perplexing problems with the data warehousing systems. There are many approaches
to solving this problem. You will most likely choose to carry periodical snapshots
of the inventory data to the data warehouse. This scenario can apply to a very large
portion of the data in the operational systems. The issues associated with this
get much more complicated as extended time periods are considered.
Figure 6 illustrates how most of the operational state information cannot
be carried over the data warehouse system.
De-normalization of data
Before we consider data model de-normalization in the context of data
warehousing, let us quickly review relational database concepts and the normalization
process. E. F. Codd developed relational database theory in the late 1960s while
he was a researcher at IBM. Many prominent researchers have made significant contributions
to this model since its introduction. Today, most of the popular database platforms
follow this model closely. A relational database model is a collection of two-dimensional
tables consisting of rows and columns. In the relational modeling terminology, the
tables, rows, and columns are respectively called relations, attributes, and tuples.
The name for relational database model is derived from the term relation for a table.
The model further identifies unique keys for all tables and describes the relationship
between tables.
Normalization is a relational database modeling process where the relations
or tables are progressively decomposed into smaller relations to a point where all
attributes in a relation are very tightly coupled with the primary key of the relation.
Most data modelers try to achieve the "Third Normal Form" with all of
the relations before they de-normalize for performance or other reasons. The three
levels of normalization are briefly described below:
- First Normal Form: A relation is said to be in First Normal Form
if it describes a single entity and it contains no arrays or repeating attributes.
For example, an order table or relation with multiple line items would not be in
First Normal Form because it would have repeating sets of attributes for each line
item. The relational theory would call for separate tables for order and line items.
- Second Normal Form: A relation is said to be in Second Normal Form
if in addition to the First Normal Form properties, all attributes are fully dependent
on the primary key for the relation.
- Third Normal Form: A relation is in Third Normal Form if in addition
to Second Normal Form, all non-key attributes are completely independent of each
other.
The process of normalization generally breaks a table into many independent
tables. While a fully normalized database can yield fantastically flexible model,
it generally makes the data model more complex and difficult to follow. Further,
a fully normalized data model can perform very inefficiently. A data modeler in
an operational system would take normalized logical data model and convert it into
a physical data model that is significantly de-normalized. De-normalization reduces
the need for database table joins in the queries.
Some of the reasons for de-normalizing the data warehouse model are the
same as they would be for an operational system, namely, performance and simplicity.
The data normalization in relational databases provides considerable flexibility
at the cost of the performance. This performance cost is sharply increased in a
data warehousing system because the amount of data involved may be much larger.
A three-way join with relatively small tables of an operational system may be acceptable
in terms of performance cost, but the join may take unacceptably long time with
large tables in the data warehouse system.
Static relationships in historical data
Another reason that de-normalization is an important process in data
warehousing modeling is that the relationship between many attributes does not change
in this historical data. For example, in an operational system, a product may be
part of the product group "A" this month and product group "B"
starting next month. In a properly normalized data model, it would be inappropriate
to include the product group attribute with an order entity that records an order
for this product; only the product ID would be included. The relational theory would
call for a join on the order table and product table to determine the product group
and any other attributes of this product. This relational theory concept does not
apply to a data warehousing system because in a data warehousing system you may
be capturing the group that this product belonged to when the order was filled.
Even though the product moves to different groups over time, the relationship between
the product and the group in context of this particular order is static.
Another important example can be the price of a product. The prices in
an operational system may change constantly. Some of these price changes may be
carried to the data warehouse with a periodic snapshot of the product price table.
In a data warehousing system you would carry the list price of the product when
the order is placed with each order regardless of the selling price for this order.
The list price of the product may change many times in one year and your product
price database snapshot may even manage to capture all these prices. But, it is
nearly impossible to determine the historical list price of the product at the time
each order is generated if it is not carried to the data warehouse with the order.
The relational database theory makes it easy to maintain dynamic relationships between
business entities, whereas a data warehouse system captures relationships between
business entities at a given time.
Logical transformation concepts of source application data described
here require considerable effort and they are a very important early investment
towards development of a successful data warehouse. Figure 7 highlights the logical
transformation concepts discussed in this section.
Physical transformation of operational data
Physical transformation of data homogenizes and purifies the data. These
data warehousing processes are typically known as "data scrubbing" or
"data staging" processes. The "data scrubbing" processes are
some of the most labor-intensive and tedious processes in a data warehousing project.
Yet, without proper scrubbing, the analytical value of even the clean data can be
greatly diminished. Physical transformation includes the use of easy-to-understand
standard business terms, and standard values for the data. A complete dictionary
associated with the data warehouse can be a very useful tool. During these physical
transformation processes the data is sometimes "staged" before it is entered
into the data warehouse. The data may be combined from multiple applications during
this "staging" step or the integrity of the data may be checked during
this process.
The concepts associated with the physical transformation of the data
are introduced in this sub-section. Historical data and the current operational
application data is likely to have some missing or invalid values. It is important
to note that it is essential to manage missing values or incomplete transformations
while moving the data to the data warehousing system. The end user of the data warehouse
must have a way to learn about any missing data and the default values used by the
transformation processes.
Operational terms transformed into uniform business terms
The terms and names used in the operational systems are transformed into
uniform standard business terms by the data warehouse transformation processes.
The operational application may use cryptic or difficult to understand terms for
a variety of different reasons. The platform software may impose length and format
restriction on a term, or purchased application may be using a term that is too
generic for your business. The data warehouse needs to consistently use standard
business terms that are self-explanatory.
A customer identifier in the operational systems may be called cust,
cust_id, or cust_no. Further, different operational applications may use different
terms to refer to the same attribute. For example, a customer in the loan organization
in a bank may be referred to as a Borrower. You may choose a simple standard business
term such as Customer Id in the data warehouse. This term would require little or
no explanation even to the novice user of the data warehouse.
Single physical definition of an attribute
Different systems may evolve to use different lengths and data types
for the same data element. One system may have the product ID to be either 12 or
14 numeric characters, whereas another system may accommodate product IDs of up
to 18 alphanumeric characters. The software of an operational application may support
very limited data types and it may impose severe limitations on the names. Software
of another application may support a very rich set of data types, and it may be
very flexible with the naming conventions.
As an attribute is defined physically for the data warehouse, it is essential
to use meaningful data types and lengths. Use the standard data length and data
type for each attribute everywhere it is used. A functional data dictionary can
facilitate this consistent use of physical attributes.
Consistent use of entity attribute values
All attributes in the data warehouse need to be consistent in the use
of predefined values. Different source applications invariably use different attribute
values to represent the same meaning. These different values need to be converted
into a single, most sensible value as the data is loaded into the data warehouse.
A simple example for the consistent use of entity attributes is the use
of a gender flag for an individual. One source application may use flags such as
"M" and "F" to store gender for an individual whereas another
application may use the detail "Male" and "Female" to store
gender. Other applications may use yet other values to store the same piece of information.
The data warehouse may choose to consistently use "M" and "F"
for gender for all individuals throughout the system.
A more complex example can be the case of dealing with complex data values
in the source application. Many older applications use single data value to represent
multiple attributes. An account number, for example, may not only represent a unique
account but it may also represent the account type. All accounts starting with 1
or 2 may represent one type of account whereas all other accounts may represent
something else to the business. The data warehouse would consistently use the account
ID to only represent a unique account. The account type may be computed and saved
as a separate attribute.
Issues associated with default and missing values
The data brought into the data warehouse is sometimes incomplete or contains
values that cannot be transformed properly. It is very important for the data warehouse
transformation process to use intelligent default values for the missing or corrupt
data. It is also important to devise a mechanism for users of the data warehouse
to be aware of these default values.
Some data attributes can easily be defaulted to a reasonable value when
the original is missing or corrupt. Other values can be obtained by referencing
other current data. For example, a missing product attribute such as unit-of-measure
on an order entity can be obtained by accessing the current product database. Some
attributes cannot be filled by defaults for missing values. In fact, it may be dangerous
to attempt to assign default for certain types of missing values. A poor default
may corrupt the data and lead to invalid analysis at a later stage. In these cases,
it is safest to leave the missing values as blank. In some cases, it may make sense
to pick a specific value or symbol that indicates a missing value.
The timing of the start of the period for which data is loaded into the
data warehouse can be important. It is safest to load data in the data warehouse
for complete years. This would prevent any misinterpretation of analysis run on
this data. Imagine a data warehouse that started loading data from the month of
March for the first year in the data warehouse. It is very likely that a user is
going to run a query for a range of whole year without realizing that the data for
January and February is not stored in the data warehouse. Also, missing data for
part of the year prevents any meaningful year-on-year analysis.
It is important to design a good system to log and identify data that
is missing from the data warehouse. When a user runs a query against the data warehouse,
it is essential to understand the population against which the query is run.
Figure 8 highlights the physical transformation concepts for data warehousing
systems. Physical transformation of source application data requires considerable
effort and it can be difficult at times, but a well-considered set of physical data
transformations can make a data warehouse more user-friendly. Further, accurate
and complete transformations help maintain the integrity of the data warehouse.
Business view summarization of data
Many queries and reports against most data warehouse systems are simple
aggregations based on predefined parameters. Another key attribute of today's data
warehouses is the predefined and automatically generated summary views.
For example, many people in an organization may need to see product sales
figures. They may have a need to summarize these sales figures for a week, a month,
or a quarter. It may not be practical to summarize the needed data every time an
analyst requires it. A data warehouse that contains summary views of the detail
data around the most common queries can sharply reduce the amount of processing
needed at the time of analysis. Summary views are typically created around business
entities such as customers, products, and channels.
The summary views also hide the complexities of the detail data. Of course,
performance gain is the most significant tangible aspect of the summary views in
the data warehouse. Most relational databases provide the ability to build views
for users that hide the underlying tables. In most SQL server packages, including
MS SQL Server, the view exists only as a definition and it is created at the time
it is actually used. While the concept of summary views in data warehousing systems
is similar, it important to not confuse data warehousing summary views with the
term "views" as it is used in a database system. A summary view in a data
warehouse refers to an actual table that is created and maintained independent of
when it used by a user. The key concepts around the summary views are introduced
in this section.
Initial analysis in summary views
Summary views often are generated not only by summarizing the detail
data but also by applying business rules to the detail data. For example, the summary
views may contain a filter that applies the exact business rules for considering
an order a sale or a filter that applies the business rules for allocating a sale
to a channel entity. The summary views can hide the complexities of the detail data
from the end user for many, if not most, analysis tasks.
The business rules that are applied in generating summary views can be
complex. These business rules may determine exactly what constitutes a sale or they
may determine how a sale is allocated to a sales or channel entity. Large organizations
often have complex rules to charge sales to different ledger accounts. Some sales
may be allocated to warranty replacement and thus not be counted as sales. Or, some
sales may be further discounted based on a master contract with the customer and
thus need to be reduced when calculating product sales for a period. Often, a data
warehouse will have more than one view based on business entities such as customers
and products. There may be multiple physical tables or the same table may contain
additional attributes that allow for easy queries.
In addition to applying the business rules while generating summary views,
the data warehousing system may perform complex database operations such as multi-table
joins. Product sales may be computed by joining the Sales, Invoice, and Product
tables. The criteria to join these tables may be complex. While individuals mining
data in the warehouse detail records need to understand all the complexities of
business rules, most users can retrieve effective summary business information without
fully understanding the detail data.
Significant performance gains
The single most important reason for building the summary views is the
significant performance gains they facilitate. Not only are all the complexities
of detail data interpreted for an end user; the summary views also perform the most
time-consuming data analysis before it is needed.
Summary views allow you to run a product sales query by merely setting
up a filter based on indexed fields such as date, product codes, and other relevant
criteria. Further, this query will most likely run on a sharply smaller table as
the summary views would have reduced the data from multiple tables containing millions
of rows to tens of thousands of rows. A query against this smaller table would be
significantly faster than a query that runs against detail tables joined for the
query. In some instances a summary view table can be as large as the detail tables.
This may be caused by summarization in very small units or combining multiple summary
views into one data table. For example, you may not be able to summarize the product
sales by week. Instead daily product sales figures may be required for some queries.
Even in these large summary views, the performance is generally better because many
of the table joins are eliminated and queries can generally use the indexes.
Many views into the same detail
The summary views in a data warehouse provide multiple views into the
same detail data. These views are predefined dimensions into the detail data. These
views provide an efficient method for the analyst to link with the detail data when
necessary.
For example, for the sales order data, four different product sales summary
views could be generated summarizing weekly sales data. These views summarizing
by product, customer, channel, and region all include the same detail data and they
would need to be updated or regenerated as new data is brought into the data warehouse.
Even though most of the analysis is likely to be done using the summary
views, there needs to be a simple and robust way for an analyst to drill down into
the detail data. Many business problems require review of the detail data to fully
understand a pattern or anomaly exhibited in the summarized reports or queries.
Drill down from many different summary views can lead to the same detail data. A
single anomaly in detail data may manifest itself differently in different summary
views.
Summarization and predefined analysis of data in a data warehouse system
is an important task. It is essential to maintain the integrity of the summary views
because a very large part of the data warehouse activity is against the summary
views. Figure 9 highlights the key concepts around summary views. The summary views
need to be not only designed and built, they need to be maintained as new data comes
into the data warehouse.
Definition
After considering the various attributes and concepts of data warehousing
systems, a broad definition of a data warehouse can be the following:
A data warehouse is a structured extensible environment designed for
the analysis of non-volatile data, logically and physically transformed from multiple
source applications to align with business structure, updated and maintained for
a long time period, expressed in simple business terms, and summarized for quick
analysis.
Business use of a data warehouse
No discussion of the data warehousing systems is complete without review
of the type of activity supported by a data warehouse. Some of the activity against
today's data warehouses is predefined and not much different from traditional analysis
activity. Other processes such as multi-dimensional analysis and information visualization
were not available with traditional analysis tools and methods.
There is a very interesting phenomenon that is observed with many data
warehousing projects. The users of a new data warehouse only wish to get the information
that they were able to get using the old tools and methods. They wish to replicate
their queries and reports with the data warehouse and make sure that all the numbers
match. Often there is as much apprehension of the new tools and the data warehouse
as there is excitement. It is only after using the new data warehouse for a period
of time that they start to explore and discover the new capabilities that are available
to them. Soon after, they start to have significant input into the data warehouse
enhancement process and they happily become the mentors for the new users.
Tools to be used against the data warehouse
One of the objectives of the data warehouse is to make it as flexible
and as open as possible. It is not desirable to set a steep entry price in terms
of software and training for using the data warehouse. The data warehouse should
be accessible by as many end-user tools and platforms as possible. Yet, it is not
possible to make every feature of the data warehouse available from every end user
tool.
Low-end tools such as simple query capability built into most spreadsheets
may be adequate for a user that only needs to quickly reference the data warehouse.
Other users may require the use of the most powerful multi-dimensional analysis
tools. The data warehouse administrators need to identify the tools that are supported
for access to the data warehouse and the capabilities that are available using these
different tools. There can be a progression path to the higher level tools for the
data warehouse users. A user can start with a low-level tool that is already familiar
to him or her. After becoming familiar with the data warehouse he or she may be
able to justify the cost and effort involved with using a more complex tool.
In most data warehousing projects, there is a need to select a preferred
data warehouse access tool for the most active users. A small number of users generate
most of the analysis activity against the data warehouse. The data warehouse performance
can be tuned to the requirements of the tool appropriate for these active users.
This tool can be used for training and demonstration of the data warehouse.
Standard reports and queries
Many users of the data warehouse need to access a set of standard reports
and queries. It is desirable to periodically automatically produce a set of standard
reports that are required by many different users. When these users need a particular
report, they can just view the report that has already been run by the data warehouse
system rather than running it themselves. This facility can be particularly useful
for reports that take a long time to run.
Such a facility would require report server software. It is likely that
these reports can be accessed only using the client program for that system. This
facility would need to work with or be part of the preferred data warehouse access
tool previously mentioned. Many end user query and analysis tools now include server
software that can be run with the data warehouse to serve reports and query results.
These tools are now providing a web interface to the reports. In many data warehouse
systems, this report and query server becomes an essential facility. The data warehouse
users and administrators constantly need to consider any reports that are candidates
to become standard reports for the data warehouse. Frequently, individual users
may develop reports that can be used by other users.
In addition to standard reports and queries, sometimes it is useful to
share some of the advanced work done by other users. A user may produce advanced
analysis that can be parameterized or otherwise adapted by other users in different
parts of the same organization or even in organizations.
Queries against summary tables
As introduced earlier, the summary views in the data warehouse can be
the object of a large majority of analysis in a data warehouse. Simple filtering
and summation from the summary views accounts for most of the analysis activity
against many data warehouses. These summary views contain predefined standard business
analysis.
For example, in a typical data warehouse, the product summary view may
account for a very large number of queries where different users select different
products and the time periods for product sales and profit margin queries. These
queries provide quick response and they are very simple to build. Advanced users
typically attach a pivot table in their analysis tool to data warehouse summary
tables for simple multi-dimensional analysis.
Data mining in the detail data
Even though data mining in the detail data may account for a very small
percentage of the data warehouse activity, the most useful data analysis might be
done on the detail data. The reports and queries off the summary tables are adequate
to answer many "what" questions in the business. The drill down into the
detail data provides answers to "why" and "how" questions.
Data mining is an evolving science. A data-mining user starts with summary
data and drills down into the detail data looking for arguments to prove or disprove
a hypothesis. The tools for data mining are evolving rapidly to satisfy the need
to understand the behavior of business units such as customers and products.
Interface with other data warehouses
The data warehouse system is likely to be interfaced with other applications
that use it as the source of operational system data. A data warehouse may feed
data to other data warehouses or smaller data warehouses called data marts.
The operational system interfaces with the data warehouse often become
increasingly stable and powerful. As the data warehouse becomes a reliable source
of data that has been consistently moved from the operational systems, many downstream
applications find that a single interface with the data warehouse is much easier
and more functional than multiple interfaces with the operational applications.
The data warehouse can be a better single and consistent source for many kinds of
data than the operational systems. It is however, important to remember that the
much of the operational state information is not carried over to the data warehouse.
Thus, data warehouse cannot be source of all operation system interfaces.
Figure 10 illustrates the analysis processes that run against a data
warehouse. Although a majority of the activity against today's data warehouses is
simple reporting and analysis, the sophistication of analysis at the high end continues
to increase rapidly. Of course, all analysis run at data warehouse is simpler and
cheaper to run than through the old methods. This simplicity continues to be a main
attraction of data warehousing systems.
Summary
This paper introduced the fundamental concepts of data warehousing. It
is important to note that data warehousing is a science that continues to evolve.
Many of the design and development concepts introduced here greatly influence the
quality of the analysis that is possible with data in the data warehouse. If invalid
or corrupt data is allowed to get into the data warehouse, the analysis done with
this data is likely to be invalid.
After the rapid acceptance of data warehousing systems during past three
years, there will continue to be many more enhancements and adjustments to the data
warehousing system model. Further evolution of the hardware and software technology
will also continue to greatly influence the capabilities that are built into data
warehouses.
Data warehousing systems have become a key component of information technology
architecture. A flexible enterprise data warehouse strategy can yield significant
benefits for a long period.
|