“The data warehouse is always a physically separate store of data transformed from the application data found in the operational environment”.Data entering the data warehouse comes from operational environment in almost every case.Data warehousing provides architectures and tools for business executives to systematically organize ,understand ,and use their data to make stragetic decisions.A large number of organizations have found that data warehouse systems are valuable tools in today’s competive,fast-evolving world. In the last several years ,many firms have spent millions of dollars in building enterprise wide data warehouses. Many people feel that with competition mounting in every industry ,data warehousing is the latest must have marketing weapon –a way to keep customers by learning more about their needs.

So you may ask,full of intrigue,” what exactly is a data warehose “. Data warehouses have been defined in many ways,making it difficult to formulate a rigorous definition. Loosely speaking , a data warehouse refers to a database that is maintened separately from an organization,s operational databases. Data warehouse systems allow for integration of a variety of applications systems . They support information processing by providing a solid platform of consolidated historical data for analysis.

Data warehousing is a more formalised methodology of these techniques. For example, many sales analysis systems and executive information systems (EIS) get their data from summary files rather then operational transaction files. The method

of using summary files instead of operational data is in essence what data warehousing is allabout.Some data warehousing tools neglect the importance of modelling and building a datawarehouse and focus on the storage and retrieval of data only. These tools might havestrong analytical facilities, but lack the qualities you need to build and maintain a corporatewide data warehouse. These tools belong on the PC rather than the host.Your corporate wide (or division wide) data warehouse needs to be scalable, secure, openand, above all, suitable for publication.

· SCABLE means that your data warehouse must be able to handle both a growing volume and variety of data and a growing number of users that can access it. Most companies prefer for this reason to store their corporate wide data warehouse in a relational database above a multi dimensional data base storage. (You can model your data dimensional and store it in a relational database. More about dimensional modelling techniques later.

· SECURE means that your data warehouse administrator can centrally control who is

allowed to access what data and when.

· OPEN means that the data in your data warehouse is open to a wide range of query andother front end tools. For this reason a relational data base should be your first choice for a corporate wide data warehouse. The proprietary data storage structures that are used by some data analysis tools can be fed from this central data warehouse.

1.3 DATA WAREHOUSE - SUBJECT ORIENTED

· The alignment around subject areas affects the design and implementation of the data found in the data warehouse.

· Most prominently, the major subject areas influence the most important part of the key structure.

· The difference between application orientation and and subject orientation show up as a difference in the content of the data at the detailed level as well.

· Data warehouse data excludes data that will not be used for DSS processing, while operational application-oriented data contains data to satisfy immediate functional/processing requirements that may or may not be of use to the DSS analyst.

DATA WAREHOUSE - INTEGRATED

· There is no consistency in encoding structures, naming conventions, measurement of variables and physical attributes of data among different data sources.

· When data is moved to the warehouse, it is converted.

DATA WAREHOUSE – NON VOLATILE

· Inserts, deletes, and changes - are done regularly to the operational environment on a record-by-record basis.

· Only two kinds of operations occur on the warehouse data:

o loading of data

o access of data

· Update of data does not occur in the data warehouse environment.

· At the design level, the need to be cautious of the update anomaly is no factor in the data warehouse, since update of data is not done.

· So at the physical level of design, liberties can be taken to optimize the access of data, particularly in dealing with the issues of normalization and physical denormalization.

· Underlying technology used to run the data warehouse environment need not support record-by-record update in an on-line mode .

DATA WAREHOUSE – TIME VARIENCE

· The time horizon for the data warehouse is significantly longer than that of operational systems.

· Operational database contains current value data. Data warehouse data is nothing more than a sophisticated series of snapshots, taken as of some moment in time.

· The key structure of operational data may or may not contain some element of time. The key structure of the data warehouse always contains some element of time.

· Data warehouse data represents data over a long time horizon - from five to ten years.

· The time horizon represented for the operational environment is much shorter - from the current values of today up to sixty to ninety days.

· Applications that must perform well and must be available for transaction processing must carry the minimum amount of data.

· Data warehouse data, once correctly recorded, cannot be updated.

· Data warehouse data is, for all practical purposes, a long series of snapshots. Operational data, being accurate as of the moment of access, can be updated as the need arises.

1.4 NEED OF DATA WAREHOUSE (2)

· performance

o special data organization, access methods, and implementation methods are needed to support operations typical of OLAP

o Complex OLAP queries would degrade performance for operational transactions

o Transaction and data integrity, detection and remedy of deadlocks and recovery are not required

1.4 NEED OF DATA WAREHOUSE (2)

· Function

o missing data: Decision support requires historical data which operational DBs do not typically maintain

o data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources: operational DBs, external sources

o data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled.

• OPERATIONAL DATA WAREHOUSE àfor the DW is supplied from mainframe operational data held in first generation hierarchical and network databases, departmental data held in proprietary file systems, private data held on workstaions and private serves and external systems such as the Internet, commercially available DB, or DB assoicated with and organization’s suppliers or customers

• OPERATIONAL DATABASEàis a repository of current and integrated operational data used for analysis. It is often structured and supplied with data in the same way as the data warehouse, but may in fact simply act as a staging area for data to be moved into the warehouse

• LOAD MANAGER àalso called the frontend component, it performance all the operations associated with the extraction and loading of data into the warehouse. These operations include simple transformations of the data to prepare the data for entry into the warehouse

• WAREHOUSE MANAGER àperforms all the operations associated with the management of the data in the warehouse. The operations performed by this component include analysis of data to ensure consistency, transformation and merging of source data, creation of indexes and views, generation of denormalizations and aggregations, and archiving and backing-up data

• QUERY MANAGERàalso called backend component, it performs all the operations associated with the management of user queries. The operations performed by this component include directing queries to the appropriate tables and scheduling the execution of queries

• detailed, lightly and lightly summarized data,archive/backup data

• meta-data

• end-user access toolsàcan be categorized into five main groups: data reporting and query tools, application development tools, executive information system (EIS) tools, online analytical processing (OLAP) tools, and data mining tools

1.6 DATA MART

DATA MART à a subset of a data warehouse that supports the requirements of particular department or business function

• The characteristics that differentiate data marts and data warehouses include:

• a data mart focuses on only the requirements of users associated with one department or business function

• as data marts contain less data compared with data warehouses, data marts are more easily understood and navigated

• data marts do not normally contain detailed operational data, unlike data warehouse

1.6.1 DATA MART ISSUE

· DATA MART SIZE :the performance deteriorates as data marts grow in size, so need to reduce the size of data marts to gain improvements in performance

· DATA MART FUNCTIONALITYàthe capabilities of data marts have increased with the growth in their popularity

· DATA MART PERFORMENCEàtwo critical components: end-user response time and data loading performanceàto increment DB updating so that only cells affected by the change are updated and not the entire MDDB structure

· USER ACESS TO DIFFERENT MARTSàone approach is to replicate data between different data marts or, alternatively, build virtual data martàit is views of several physical data marts or the corporate data warehouse tailored to meet the requirements of specific groups of users

· DATA MART INTERNETàit’s products sit between a web server and the data analysis product.Internet/intranet offers users low-cost access to data marts and the data WH using web browsers.

· DATA MART ADMINISTRATIONàorganization can not easily perform administration of multiple data marts, giving rise to issues such as data mart versioning, data and meta-data consistency and integrity, enterprise-wide security, and performance tuning . Data mart administrative tools are commerciallly available

· DATA MART INSTALLATIONàdata marts are becoming increasingly complex to build. Vendors are offering products referred to as ”data mart in a box” that provide a low-cost source of data mart tools

1.7 DATA FLOW

• Inflow- The processes associated with the extraction, cleansing, and loading of the data from the source systems into the data warehouse.

• upflow- The process associated with adding value to the data in the warehouse through summarizing, packaging , packaging, and distribution of the data

• downflow- The processes associated with archiving and backing-up of data in the warehouse

• outflow- The process associated with making the data availabe to the end-users

• Meta-flow- The processes associated with the management of the meta-data

1.8 THE IMPORTANCE OF META DATA

• The integration of meta-data, that is ”data about data”

• Meta-data is used for a variety of purposes and the management of it is a critical issue in achieving a fully integrated data warehouse

• The major purpose of meta-data is to show the pathway back to where the data began, so that the warehouse administrators know the history of any item in the warehouse

• The meta-data associated with data transformation and loading must describe the source data and any changes that were made to the data

• The meta-data associated with data management describes the data as it is stored in the warehouse

• The meta-data is required by the query manager to generate appropriate queries, also is associated with the user of queries

• The major integration issue is how to synchronize the various types of meta-data use throughout the data warehouse. The challenge is to synchronize meta-data between different products from different vendors using different meta-data stores

• Two major standards for meta-data and modeling in the areas of data warehousing and component-based development-MDC(Meta Data Coalition) and OMG(Object Management Group)

• a data warehouse requires tools to support the administration and management of such complex enviroment.

• for the various types of meta-data and the day-to-day operations of the data warehouse, the administration and management tools must be capable of supporting those tasks:

• monitoring data loading from multiple sources

• data quality and integrity checks

• managing and updating meta-data

• monitoring database performance to ensure efficient query response times and resource utilization

· Relational DBMS as Warehouse Servers

· Two possibilities for OLAP servers

· (1) Relational OLAP (ROLAP)

- Relational and specialized relational DBMS to store and manage warehouse data

- OLAP middleware to support missing pieces

· (2) Multidimensional OLAP (MOLAP)

- Array-based storage structures

- Direct access to array data structures

An expanded view of the model shows three dimensions: Time, Store and Product. Attribute hierarchies are vertical relationships, while extended attribute characteristics are diagonal.

• In the time dimension,

• a given date is further described by its extended attributes "current flag," "sequence" and "day of the week."

• Extended attribute characteristics have no impact on granularity. The fact that February 4, 1996 is on a Sunday has no effect on the fact that we collect sales by day. In practice, though. we may wish to compare Sunday sales to other days. We can do this by constraining our query on the extended attribute characteristic "day of the week" without having to gather any additional information.

• The store dimension includes an extended attribute characteristic at a higher level: each region has a Regional Manager.

• Attribute hierarchies imply aggregation of data: stores roll up into districts; districts into regions.

1.10 ROLAP: Dimensional Modeling Using Relational DBMS

·Special schema design: star, snowflake

·Special indexes: bitmap, multi-table join

·Special tuning: maximize query throughput

·Proven technology (relational model, DBMS), tend to outperform specialized MDDB especially on large data sets

·Products

-IBM DB2, Oracle, Sybase IQ, RedBrick, Informix

Benefits : Easy to understand, easy to define hierarchies, reduces # of physical joins, low maintenance, very simple metadata

Drawbacks : Summary data in the fact table yields poorer performance for summary levels, huge dimension tables a problem

· The fact table may also contain partially consolidated data, such as sales dollars for a region, for a given product for a given time period.

· Confusion is possible if ALL consolidated data isn't included. For example, if data is consolidated only to the district level, a query for regional information will bring back no records and, hence, report no sales activity. For this reason, simple stars MUST contain either

- ALL of the combinations of aggregated data or

- At least views of every combination

One approach is to create a multi-part key, identifying each record by the combination of store/district/region. Using compound keys in a dimension table can cause problems:

· It requires three separate metadata definitions to define a single relationship, which adds to complexity in the design and sluggishness in performance.

· Since the fact table must carry all three keys as part of its primary key, addition or deletion of levels in the hierarchy (such as the addition of "territory" between store and district) will require physical modification of the fact table, a time-consuming process that limits flexibility

· Carrying all of the segments of the compound dimensional key in the fact table increases the size of the crucial fact table index, a real determinant to both performance and scalability.

The biggest drawback: dimension tables must carry a level indicator for every record and every query must use it. In the example below, without the level constraint, keys for all stores in the NORTH region, including aggregates for region and district will be pulled from the fact table, resulting in error.

Example:

· Level is a problem because it causes potential for error. If the query builder, human or program, forgets about it, perfectly reasonable looking WRONG answers can occur.

· One alternative: the FACT CONSTELLATION model (summary tables

The biggest drawback of the level indicator is that it limits flexibility (we may not know all of the levels in the attribute hierarchies at first). By limiting ourselves to only certain levels, we force a physical structure that may change, resulting in higher maintenance costs and more downtime. The level concept is a useful tool for very controlled data warehouses, that is, those that either have no ad hoc users or at least only those ad hoc users who are knowledgably about the database. In particular, when the results of queries are pre-formatted reports or extracts to smaller systems, such as data marts, the drawbacks of the level indicator are not so evident

The chart above is composed of all of the tables from the Classic Star, plus aggregated fact (summary) tables. For example, the Store dimension is formed of a hierarchy of store-> district -> region. The District fact table contains ONLY data aggregated by district, therefore there are no records in the table with STORE_KEY matching any record for the Store dimension at the store level. Therefore, when we scan the Store dimension table, and select keys that have district = "Texas," they will only match STORE_KEY in the District fact table when the record is aggregated for stores in the Texas district. No double (or triple, etc.) counting is possible and the Level indicator is not needed. These aggregated fact tables can get very complicated, though. For example, we need a District and Region fact table, but what level of detail will they contain about the product dimension? All of the following:

STORE/PROD DISTRICT/PROD REGION/PROD

STORE/BRAND DISTRICT/BRAND REGION/BRAND

STORE/MANUF DISTRICT/MANUF REGION/MANUF

And these are just the combinations from two dimensions!

In the Fact Constellations, aggregate tables are created separately from the detail, therefore, it is impossible to pick up, for example, Store detail when querying the District Fact Table.

Major Advantage: No need for the “Level” indicator in the dimension tables,

since no aggregated data is stored with lower-level detail

Disadvantage: Dimension tables are still very large in some cases, which can slow performance; front-end must be able to detect existence of aggregate facts, which requires more extensive metadata

ANOTHER ALTERNATIVE TO LEVEL

· Fact Constellation is a good alternative to the Star, but when dimensions have very high cardinality, the sub-selects in the dimension tables can be a source of delay.

· Another drawback is that multiple SQL statements may be needed to answer a single question, for example: measure the percent to total of a district to its region. Separate queries are needed to both the district and region fact tables, then some complex "stitching" together of the results is needed.

· Once again, it is easy to see that even with its disadvantages, the Classic Star enjoys the benefit of simplicity over its alternatives.

· An alternative is to normalize the dimension tables by attribute level, with each smaller dimension table pointing to an appropriate aggregated fact table, the “Snowflake Schema” ...

SNOW FLAKE SCHEMA

· No LEVEL in dimension tables

· Dimension tables are normalized by decomposing at the attribute level

· Each dimension table has one key for each level of the dimensions hierarchy

· The lowest level key joins the dimension table to both the fact table and the lower level attribute table

How does it work? The best way is for the query to be built by understanding which summary levels exist, and finding the proper snowflaked attribute tables, constraining there for keys, then selecting from the fact table.

Notice how the Store dimension table generates subsets of records.

First, all records from the table (where level = "District" in the Star) are extracted, and only those attributes that refer to that level (District Description, for example) and the keys of the parent hierarchy (Region_ID) are included in the table. Though the tables are subsets, it is absolutely critical that column names are the same throughout the schema.

The diagram above is a partial schema - it only shows the "snowflaking" of one dimension. In fact, the product and time dimensions would be similarly decomposed as follows: Product - product -> brand -> manufacturer (color and size are extended attribute characteristics of the attribute "product," not part of the attribute hierarchy)

Time - day -> month -> quarter -> year

· Aditional Features : The original Store Dimension table, completely de-normalized, is kept intact, since certain queries can benefit by its all-encompassing content.

· In practice, start with a Star Schema and create the “snowflakes” with queries. This eliminates the need to create separate extracts for each table, and referential integrity is inherited from the dimension table.

· Advantage: Best performance when queries involve aggregation

· Disadvantage: Complicated maintenance and metadata, explosion in the number

· of tables in the database

1.11 Data Warehousing Processes

The process of extracting data from source systems and bring it into the data warehouse is commonly called ELT, which stands for extraction, transformation, and loading. However, the acronym ETL is perhaps too simplistic, because it omits some other phases in the process of creating data warehouse data from the data sources, such as data cleansing and transportation. Here, I refer to the entire process for building a data warehouse, including the five phase mentioned above, as ELT. In addition, after the data warehouse (detailed data) is created, several data warehousing processes that are relevant to implementing and using the data warehouse are needed, which include data summarization, data warehouse maintenance, date l ineage tracing, query rewriting, and data mining. In this section, I give a brief description for each process in the data warehouse.

Extraction in Data Warehouse

Extraction is the operation of extracting data from a source system for future use in a data warehouse\environment. This is the first step of the ETL process. After extraction, data can be transformed and loaded into the data warehouse. Extraction process does not need involve complex algebraic database operations, such as join and aggregate functions. Its focus is determining which data needs to be extracted, and bring the data into the data warehouse, specifically, to the staging area. However, the data sources might be very complex

and poorly documented, so that designing and creating the extraction process is often the most timeconsuming task in the ELT process, and even in the entire data warehousing process. The data has to be extracted normally not only once, but several times in a periodic manner to supply all changed data to the data warehouse and keep it up-to-date. Thus, data extraction is not only used in the process of building the data warehouse, but also in the process of maintaining the data warehouse. The extraction method should be chosen highly dependent on the data source situation and the target of the data warehouse. Normally, the data sources cannot be modified, nor can its performance or availability be adjusted by extraction process. Every often, entire documents or tables from the data sources are extracted to the data warehouse or staging area, and the data completely contain whole information from the data sources. There are two kinds of logic extraction methods in data warehousing

Full Extraction

The data is extracted completely from the data sources. As this extraction reflects all the data

currently available on the data source, there is no need to keep track of changes to the data source since the last successful extraction. The source data will be provided as-is and no additional logic information (e.g., timestamps) is necessary on the source site.

Incremental Extraction

At a specific point in time, only the data that has changed since a well-defined event back in history will be extracted. The event may be the last time of extraction or a more complex business event like the last sale day of a fiscal period. To identify this delta change there must be a possibility to identify all the changed information since this specific time event. This information can be either provided by the source data itself, or a change table where an appropriate additional mechanism keeps track of the changes besides the originating transaction. in most case, using the latter method means adding extraction logic to the data source.

For the independence of data sources, many data warehouses do not use any change-capture technique as part of the extraction process, instead, use full extraction logic. After full extracting, the entire extracted data from the data sources can be compared with the previous extracted data to identify the changed data. This approach may not have significant impact on the data source, but it clearly can place a considerable burden on the data warehouse processes, particularly if the data volumes are large. Incremental extraction, also called Change Data Capture, is an important consideration for extraction. it is possible to make the ELT process much more efficient, and especially, be used in the situation when I select incremental view maintenance as the maintenance approach of the data warehouse. Unfortunately, for many source systems, identifying the recently modified data may be difficult or intrusive

to the operation of the data source. Change Data Capture is typically the most challenging technical issue in data extraction.

Data Cleaning in Data Warehouse

Extracting data from remote data sources, especially heterogeneous databases, can bring a lot of error

and inconsistent information to the warehouse, i.e., duplicate, alias, abbreviation, and synonym data. Data warehouses usually face this problem, in their role as repository for information derived from multiple sources within and across enterprises. Thus, before transporting data from staging area to the warehouse, data cleaning is normally required. Data cleaning, also called data cleansing or scrubbing, is a process to deal with detecting and removing errors and inconsistencies from data in order to improve the data quality.

The problems of data cleaning include single-source problems and multi-source problems. Cleaning data for single-source is cleaning ‘dirty’ data for each repository. This process involves formatting and standardizing the source data, such as, adding a key to every source record; according the requirement of the warehouse, decomposing some dimension into sub-dimension. E.g., decomposing Address dimension into Location, Street, City, and Zip. While cleaning data for multi-source considers several data sources when doing the cleaning process, it may include aggregating performance. For aggregating duplicated databases, the Customer and Client databases are integrated into Customers database. In this process, both database structure and database content could be changed.

All problems relating to data cleaning can be distinguished between schema-level and physical(instance)- level. Schema-level problems are addressed at the schema level by an improved schema design (schema evolution), schema translation and schema integration. They of course are also reflected in the physical instances. Physical-level problems, on the other hand, refer to errors and inconsistencies in the actual data contents which are not visible at the schema level.

Transportation in Data Warehouse

Transportation is an operational notation expressing the process moving data from one system to another system. It is often one of the simpler portions of the ELT process, and can be integrated with other portions of the process, such as extraction and transformation. In a data warehouse environment, the most common requirements for transportation are in moving data from a source system to a staging database or a data warehouse database, a staging database to a data warehouse, and a data warehouse to a data mart.

Transportation is not often appearing in data warehouse literature, since its functionalities could be split by other data warehousing processes, and so did in this report.

Transforming in Data Warehouse

Data transformation is a critical step in any data warehouse development e®ort. The data sources of a data warehouse contain multiple data structures, while the data warehouse has a single kind of data structure. Each heteromerous data has to be transformed into an uniform structured data before loading into the data warehouse. Since the data extracted from data sources is complex with arbitrary data structure and the problems described in data transformations are often the most complex, costly part of the ELT process, and combined with data cleaning process. Two ways relating data transformation are often used in data warehousing:

1. Multistage Data Transformation

2. Pipelined Data Transformation

The former is implementing each different transformation as a separate operation and creating a separate, temporary staging tables to store the incremental results for each step. It is a common strategy and enables to make the transformation process more easily monitored and restarted. However, a disadvantage to multistage transformation is that the space and time requirements increase. In contrast, pipelined transformation is a process of transform whileloading. There is no temporary staging table in the middle of the transformation process. It consequently increase the di±culty of monitoring and the requirement of the similarity between the source and the target data.

Loading in Data Warehouse

Data loading is a process moving data from one data system to another system, especially, make the data become accessible for the data warehouse. In this sense, data loading is a repeat notation of data transportation. However, data loading is much more appearing in database literature, and always connecting with data transformation in data warehousing since before any data transformation can occur with a database, the raw data must become accessible for the database. Loading the data into the database is one approach to solve this problem. Furthermore, in practise, Load is an existing command in many commercial database systems and SQL language. SQL*Loader is a tool used to move data from flat files into an Oracle data warehouse.

1.2 SECURITY IN DATA WAREHOUSING

Data warehouse is an integrated repository derived from multiple source (operational and legacy) databases. The data warehouse is created by either replicating the different source data or transforming them to new representation. This process involves reading, cleaning, aggregating and storing the data in the warehouse model. The software tools are used to access the warehouse for strategic analysis, decision-making, marketing types of applications. It can be used for inventory control of shelf stock in many departmental stores.

Medical and human genome researchers can create research data that can be either marketed or used by a wide range of users. The information and access privileges in data warehouse should mimic the constraints of source data. A recent trend is to create web-based data warehouses and multiple users can create components of the warehouse and keep an environment that is open to third party access and tools. Given the opportunity, users ask for lots of data in great detail. Since source data can be expensive, its privacy and security must be assured. The idea of adaptive querying can be used to limit access after some data has been offered to the user. Based on the user profile, the access to warehouse data can be restricted or modified.

In this talk, I will focus on the following ideas that can contribute towards warehouse security.

1. Replication control

Replication can be viewed in a slightly different manner than perceived in traditional literature. For example, an old copy can be considered a replica of the current copy of the data. A slightly out-of date data can be considered as a good substitute for some users. The basic idea is that either the warehouse keeps different replicas of the same items or creates them dynamically. The legitimate users get the most consistent and complete copy of data while casual users get a weak replica. Such replica may be enough to satisfy the user's need but do not provide information that can be used maliciously or breach privacy. We have formally defined the equivalence of replicas and this notion can be used to create replicas for different users. The replicas may be at one central site or can be distributed to proxies who may serve the users efficiently. In some cases the user may be given the weak replica and may be given an upgraded replica if wiling to pay or deserves it.

2. Aggregation and Generalization

The concept of warehouse is based on the idea of using summaries and consolidators. This implies that source data is not available in raw form. This lends to ideas that can be used for security. Some users can get aggregates only over a large number of records where as others can be given for small data instances. The granularity of aggregation can be lowered for genuine users. The generalization idea can be used to give users a high level information at first but the lower level details can be given after the security constraints are satisfied. For example, the user may be given an approximate answer initially based on some generalization over the domains of the database. Inheritance is another notion that will allow increasing capability of access for users. The users can inherit access to related data after having access to some data item.

3. Exaggeration and Misleading

These concepts can be used to mutilate the data. A view may be available to support a particular query, but the values may be overstated in the view. For security concern, quality of views may depend on the user involved and user can be given an exaggerated view of the data. For example, instead of giving any specific sales figures, views may scale up and give only exaggerated data. In certain situations warehouse data can give some misleading information; information which may be partially incorrect or difficult to verify the correctness of the information. For example, a view of a company’s annual report may contain the net profit figure including the profit from sales of properties (not the actual sales of products).

4. Anonymity

Anonymity is to provide user and warehouse data privacy. A user does not know the source warehouse for his query and warehouse also does not who is the user and what particular view a user is accessing (view may be constructed from many source databases for that warehouse). Note that a user must belong to the group of registered users and similarly, a user must also get data from only legitimate warehouses. In such cases, encryption is to be used to secure the connection between the users and warehouse so that no outside user (user who has not registered with the warehouse) can access the warehouse.

5. User Profile Based Security

User profile is a representation of the preferences of any individual user. User profiles can help in authentication and determining the levels of security to access warehouse data. User profile must describe how and what has to be represented pertaining to the users information and security level authorization needs. The growth in warehouses has made relevant information access difficult in reasonable time due to the large number of sources differ in terms of context and representation. Warehouse can use data category details in determining the access control. For example, if a user would like to access an unpublished annual company report, the warehouse server may deny access to it. The other alternative is to construct a view to reflect only projected sales and profit report. Such a construction of view may be transparent to the user. A server can use data given in the profile to decide whether the user should be given the access to associated graphical image data. The server has the option to reduce the resolution or later the quality of impages before making them available to users.

1.3 APPLICATION OF DATA WAREHOUSE

Exploiting Data for Business Decisions

The value of a decision support system depends on its ability to provide the decision-maker with relevant

information that can be acted upon at an appropriate time. This means that the information needs to be:

· Applicable. The information must be current, pertinent to the field of interest and at the correct level of

detail to highlight any potential issues or benefits.

· Conclusive. The information must be sufficient for the decision-maker to derive actions that will bring

benefit to the organisation.

· Timely. The information must be available in a time frame that allows decisions to be effective.

Each of these requirements has implications for the characteristics of the underlying system. To be effective,a decision support system requires access to all relevant data sources, potentially at a detailed level. It must also be quick to return both ad-hoc and pre-defined results so that the decision-maker can investigate to an appropriate level of depth without affecting performance in other areas.

Decision Support through Data Warehousing

One approach to creating a decision support system is to implement a data warehouse, which integrates

existing sources of data with accessible data analysis techniques. An organisation’s data sources are typically departmental or functional databases that have evolved to service specific and localised requirements. Integrating such highly focussed resources for decision support at the enterprise level requires the addition of other functional capabilities:

· Fast query handling. Data sources are normally optimised for data storage and processing, not for their

speed of response to queries.

· Increased data depth. Many business conclusions are based on the comparison of current data with

historical data. Data sources are normally focussed on the present and so lack this depth.

· Business language support. The decision-maker will typically have a background in business or

management, not in database programming. It is important that such a person can request information

using words and not syntax.

A data warehouse meets these requirements by combining the data from the various and disparate sources into a central repository on which analysis can be performed. This repository is normally a relational database that provides both the capacity for extra data depth and the support for servicing queries. Analytical functions are provided by a separate component which is optimised for extracting, assembling and presenting summary information in response to word-based queries, such as “show me last week’s sales by region”.

The diagram illustrates a simplified data warehouse for a manufacturing company. The departmental and functional interfaces are intended to be indicative only.

The Application of Data Warehouses

The proliferation of data warehouses is highlighted by the “customer loyalty” schemes that are now run by

many leading retailers and airlines. These schemes illustrate the potential of the data warehouse for “micromarketing” and profitability calculations, but there are other applications of equal value, such as:

· Stock control

· Product category management

· Basket analysis

· Fraud analysis

All of these applications offer a direct payback to the customer by facilitating the identification of areas that require attention. This payback, especially in the fields of fraud analysis and stock control, can be of high and immediate value.

1.5 CONCLUSION :

What we are seeing is two-fold depending on the retailer's strategy:

1) Most retailers build data warehouses to target specific markets and customer

segments. They're trying to know their customers. It all starts with CDI -- customer

data integration. By starting with CDI, the retailers can build the DW around the

customer.

2) On the other side -- there are retailers who have no idea who their customers are, or

feel they don’t need to….the world is their customer and low prices will keep the world

loyal. They use their data warehouse to control inventory and negotiate with suppliers.

The future will bring real time data warehouse updates…with the ability to give the

retailer an minute to minute view of what is going on in a retail location…and take

action either manually or through a condition triggered by the data warehouse data…

The future belongs to those who:

1) Possess knowledge of the Customer and

2) Effectively use that knowledge…

Engineering Seminar Topics :: Seminar Paper

DATA WAREHOUSING

1.3 DATA WAREHOUSE - SUBJECT ORIENTED

· The alignment around subject areas affects the design and implementation of the data found in the data warehouse.

· Most prominently, the major subject areas influence the most important part of the key structure.

· The difference between application orientation and and subject orientation show up as a difference in the content of the data at the detailed level as well.

· Data warehouse data excludes data that will not be used for DSS processing, while operational application-oriented data contains data to satisfy immediate functional/processing requirements that may or may not be of use to the DSS analyst.

DATA WAREHOUSE - INTEGRATED

· There is no consistency in encoding structures, naming conventions, measurement of variables and physical attributes of data among different data sources.

· When data is moved to the warehouse, it is converted.

DATA WAREHOUSE – NON VOLATILE

· Inserts, deletes, and changes - are done regularly to the operational environment on a record-by-record basis.

· Only two kinds of operations occur on the warehouse data:

· Update of data does not occur in the data warehouse environment.

· At the design level, the need to be cautious of the update anomaly is no factor in the data warehouse, since update of data is not done.

· So at the physical level of design, liberties can be taken to optimize the access of data, particularly in dealing with the issues of normalization and physical denormalization.

· Underlying technology used to run the data warehouse environment need not support record-by-record update in an on-line mode .

DATA WAREHOUSE – TIME VARIENCE

· The time horizon for the data warehouse is significantly longer than that of operational systems.

· Operational database contains current value data. Data warehouse data is nothing more than a sophisticated series of snapshots, taken as of some moment in time.

· The key structure of operational data may or may not contain some element of time. The key structure of the data warehouse always contains some element of time.

· Data warehouse data represents data over a long time horizon - from five to ten years.

· The time horizon represented for the operational environment is much shorter - from the current values of today up to sixty to ninety days.

· Applications that must perform well and must be available for transaction processing must carry the minimum amount of data.

· Data warehouse data, once correctly recorded, cannot be updated.

· Data warehouse data is, for all practical purposes, a long series of snapshots. Operational data, being accurate as of the moment of access, can be updated as the need arises.

1.4 NEED OF DATA WAREHOUSE (2)

· performance

1.4 NEED OF DATA WAREHOUSE (2)

· Function

5. User Profile Based Security

1.3 APPLICATION OF DATA WAREHOUSE

2 comments:

Seminar Topics