undefined | unit 3 introduction to databases and data warehouse

ITBM2

Unit 3Introduction to Data base and Data warehouse Q1) What is the meaning of DBMS? Also write its need.A1) DBMS stands for Database Management System. We can break it like this DBMS = Database + Management System. Database is a collection of data and Management System is a set of programs to store and retrieve those data. Based on this we can define DBMS like DBMS is a collection of inter-related data and set of programs to store & access those data in an easy and effective manner.The main aim of a DBMS is to supply a way to store up and retrieve database information that is both convenient and efficient. By data, we mean known facts that can be recorded and that have embedded meaning.A database management system (DBMS) refers to the technology for creating and managing databases. DBMS is a software tool to organize (create, retrieve, update, and manage) data in a database.Knowledge refers to the useful use of information. As you know, that information can be transported, stored, and shared without any problems and difficulties, but the same cannot be said about knowledge. Knowledge necessarily involves personal experience and practice.Database systems are meant to handle an extensive collection of information. Management of data involves both defining structures for storage of information and providing mechanisms that can do the manipulation of those stored information. Moreover, the database system must ensure the safety of the information stored, despite system crashes or attempts at unauthorized access.

To develop software applications in less time.

Data independence and efficient use of data.

For uniform data administration.

For data integrity and security.

For concurrent access to data, and data recovery from crashes.

To use user-friendly declarative query language.

Need for using DBMS: Database systems are basically developed for large amount of data. When dealing with huge amount of data, there are two things that require optimization: Storage of data and retrieval of data.Storage: According to the principles of database systems, the data is stored in such a way that it acquires lot less space as the redundant data (duplicate data) has been removed before storage. Let’s take a layman example to understand this:
In a banking system, suppose a customer is having two accounts, one is saving account and another is salary account. Let’s say bank stores saving account data at one place (these places are called tables we will learn them later) and salary account data at another place, in that case if the customer information such as customer name, address etc. are stored at both places then this is just a wastage of storage (redundancy/ duplication of data), to organize the data in a better way the information should be stored at one place and both the accounts should be linked to that information somehow. The same thing we achieve in DBMS.Fast Retrieval of data: Along with storing the data in an optimized and systematic manner, it is also important that we retrieve the data quickly when needed. Database systems ensure that the data is retrieved as quickly as possible.The main purpose of database systems is to manage the data. Consider a university that keeps the data of students, teachers, courses, books etc. To manage this data we need to store this data somewhere where we can add new data, delete unused data, update outdated data, retrieve data, and to perform these operations on data we need a Database management system that allows us to store the data in such a way so that all these operations can be performed on the data efficiently. Q2) Write the Concepts of integrity constraints in DBMS and its type.A2) Constraints enforce limits to the data or type of data that can be inserted/updated/deleted from a table. The whole purpose of constraints is to maintain the data integrity during an update/delete/insert into a table. In this tutorial we will learn several types of constraints that can be created in DBMS.

Integrity constraints are a set of rules. It is used to maintain the quality of information.

Integrity constraints ensure that the data insertion, updating, and other processes have to be performed in such a way that data integrity is not affected.

Thus, integrity constraint is used to guard against accidental damage to the database.

Types of constraints

NOT NULL

UNIQUE

DEFAULT

CHECK

Key Constraints – PRIMARY KEY, FOREIGN KEY

Domain constraints

Mapping constraints

NOT NULL:NOT NULL constraint makes sure that a column does not hold NULL value. When we don’t provide value for a particular column while inserting a record into a table, it takes NULL value by default. By specifying NULL constraint, we can be sure that a particular column(s) cannot have NULL values.UNIQUE:UNIQUE Constraint enforces a column or set of columns to have unique values. If a column has a unique constraint, it means that particular column cannot have duplicate values in a table.DEFAULT:The DEFAULT constraint provides a default value to a column when there is no value provided while inserting a record into a table.CHECK:This constraint is used for specifying range of values for a particular column of a table. When this constraint is being set on a column, it ensures that the specified column must have the value falling in the specified range.Types of Integrity Constraint1. Domain constraints

Domain constraints can be defined as the definition of a valid set of values for an attribute.

The data type of domain includes string, character, integer, time, date, currency, etc. The value of the attribute must be available in the corresponding domain.

2. Entity integrity constraints

The entity integrity constraint states that primary key value can't be null.

This is because the primary key value is used to identify individual rows in relation and if the primary key has a null value, then we can't identify those rows.

A table can contain a null value other than the primary key field.
3. Referential Integrity Constraints

A referential integrity constraint is specified between two tables.

4. Key constraints

Keys are the entity set that is used to identify an entity within its entity set uniquely.

An entity set can have multiple keys, but out of which one key will be the primary key. A primary key can contain a unique and null value in the relational table.

Q3) Write a note on Data warehousingA3) A Data Warehousing (DW) is process for collecting and managing data from varied sources to provide meaningful business insights. A Data warehouse is typically used to connect and analyze business data from heterogeneous sources. The data warehouse is the core of the BI system which is built for data analysis and reporting.It is a blend of technologies and components which aids the strategic use of data. It is electronic storage of a large amount of information by a business which is designed for query and analysis instead of transaction processing. It is a process of transforming data into information and making it available to users in a timely manner to make a difference.The decision support database (Data Warehouse) is maintained separately from the organization's operational database. However, the data warehouse is not a product but an environment. It is an architectural construct of an information system which provides users with current and historical decision support information which is difficult to access or present in the traditional operational data store.A 3NF-designed database for an inventory system many have tables related to each other. For example, a report on current inventory information can include more than 12 joined conditions. This can quickly slow down the response time of the query and report. A data warehouse provides a new design which can help to reduce the response time and helps to enhance the performance of queries for reports and analytics.Data warehouse system is also known by the following name:

Decision Support System (DSS)

Executive Information System

Management Information System

Business Intelligence Solution

Analytic Application

Data Warehouse

A Data Warehouse works as a central repository where information arrives from one or more data sources. Data flows into a data warehouse from the transactional system and other relational databases.Data may be:

Structured

Semi-structured

Unstructured data

The data is processed, transformed, and ingested so that users can access the processed data in the Data Warehouse through Business Intelligence tools, SQL clients, and spreadsheets. A data warehouse merges information coming from different sources into one comprehensive database.By merging all of this information in one place, an organization can analyze its customers more holistically. This helps to ensure that it has considered all the information available. Data warehousing makes data mining possible. Data mining is looking for patterns in the data that may lead to higher sales and profits. Q4) What is the Importance of data warehouse for an organization?A4) A data warehouse is a database of an organization’s multiple databases. The elaborate further, a data warehouse is a consolidation of data based on different subject matters over a period of time pulled together from different internal and external sources and departments within an organization for the purpose of analysis, discovery, reporting, and, finally, decision-making. The role and importance of data warehousing has increased rapidly with technological advancements, and as more and more enterprises focus on operating on the basis of insights derived from data-driven business strategies that favor an integrated, holistic approach to data. In some ways, a data warehouse enables decision-makers to arrive at a single version of truth that informs the business objectives and every activity undertaken by every department in service of those objectives.As we careen towards an intensely competitive world with sky-high expectations of customers from their products and services, thanks to massive leaps in the field of technological innovations, businesses are facing heat to not just make the right and lucrative financial decisions, but to make them in record time. This can be made possible only when the right kind of data warehouse is implemented. While every organization may have different goals and needs from their warehouse, generally speaking, data warehouses provide the following benefits to enterprises.Speeds up response timeSince data warehouses pull different data sets from different sources on a variety of subjects and then standardize it into a single set that can then be distributed across departments, decision-makers are able to take strategic, fact-based decisions that are compatible with each other at a much faster pace, instead of each department taking a soloed approach to data and sometimes working at cross purposes. Another major way in which data helps businesses is by allowing them to regroup and redesign business processes and operations, if after analysis they find that their cross-purposes with the business strategy, which can often be the case when departments are using different, unrelated, or outdated data sets to base their decisions on.Faster and more flexible reportingWhile on the one hand data warehousing makes it possible to collect, consolidate and create reports in real time, on the other, it creates a repository of historical information on all the variables that affect profitability. This faster and flexible data reporting mechanism makes it possible for organization’s to routinely undertake comparative and competitive analysis to gauge the efficacy of the services being provided by them, level of customer satisfaction, slow-moving products or non-performant assets, areas for improvement, find errors before they can cost the company a major loss in terms of money or reputation, identify market trends, and introduce product lines that are in line with emerging customer needs and far ahead of the competition. All these in turn improve the bottom line of the enterprise.Reduced costsWhen the process of collection, analysis and consolidation of data is simplified, and the subsequent data set is made easily accessible to all, it can result in significant savings for the company due to reduced operational and IT costs. Automating the process with the help of data warehousing tools reduces the organization’s dependency on staff dedicated to reporting and maintaining the IT infrastructure, especially when the data warehouse is cloud-based, not on-premise. Another way it reduces costs is by helping the management to make the decision to drop non-performing products without wasting too much time. Q5) Characteristics of Data warehouseA5) Data warehouse can be controlled when the user has a shared way of explaining the trends that are introduced as specific subject.

Subject-oriented –
A data warehouse is always a subject oriented as it delivers information about a theme instead of organization’s current operations. It can be achieved on specific theme. That means the data warehousing process is proposed to handle with a specific theme which is more defined. These themes can be sales, distributions, marketing etc.
A data warehouse never put emphasis only current operations. Instead, it focuses on demonstrating and analysis of data to make various decision. It also delivers an easy and precise demonstration around particular theme by eliminating data which is not required to make the decisions.

Integrated –
It is somewhere same as subject orientation which is made in a reliable format. Integration means founding a shared entity to scale the all similar data from the different databases. The data also required to be resided into various data warehouse in shared and generally granted manner.A data warehouse is built by integrating data from various sources of data such that a mainframe and a relational database. In addition, it must have reliable naming conventions, format and codes. Integration of data warehouse benefits in effective analysis of data. Reliability in naming conventions, column scaling, encoding structure etc. should be confirmed. Integration of data warehouse handles various subject related warehouse.

Time-Variant –
In this data is maintained via different intervals of time such as weekly, monthly, or annually etc. It founds various time limit which are structured between the large datasets and are held in online transaction process (OLTP). The time limits for data warehouse is wide-ranged than that of operational systems. The data resided in data warehouse is predictable with a specific interval of time and delivers information from the historical perspective. It comprises elements of time explicitly or implicitly. Another feature of time-variance is that once data is stored in the data warehouse then it cannot be modified, alter, or updated.

Non-Volatile –
As the name defines the data resided in data warehouse is permanent. It also means that data is not erased or deleted when new data is inserted. It includes the mammoth quantity of data that is inserted into modification between the selected quantities on logical business. It evaluates the analysis within the technologies of warehouse.
In this, data is read-only and refreshed at particular intervals. This is beneficial in analyzing historical data and in comprehension the functionality. It does not need transaction process, recapture and concurrency control mechanism. Functionalities such as delete, update, and insert that are done in an operational application are lost in data warehouse environment. Two types of data operations done in the data warehouse are:

Data Loading
Data Access

Q6) Elaborate the Data Warehouse ArchitectureA6) Data Warehouse Architecture is complex as it’s an information system that contains historical and commutative data from multiple sources. There are 3 approaches for constructing Data Warehouse layers: Single Tier, Two tier and three tier. This 3 tier architecture of Data Warehouse is explained as below.Single-tier architectureThe objective of a single layer is to minimize the amount of data stored. This goal is to remove data redundancy. This architecture is not frequently used in practice.Two-tier architectureTwo-layer architecture is one of the Data Warehouse layers which separates physically available sources and data warehouse. This architecture is not expandable and also not supporting a large number of end-users. It also has connectivity problems because of network limitations.Three-Tier Data Warehouse ArchitectureThis is the most widely used Architecture of Data Warehouse.It consists of the Top, Middle and Bottom Tier.

Bottom Tier: The database of the Data warehouse servers as the bottom tier. It is usually a relational database system. Data is cleansed, transformed, and loaded into this layer using back-end tools.

Middle Tier: The middle tier in Data warehouse is an OLAP server which is implemented using either ROLAP or MOLAP model. For a user, this application tier presents an abstracted view of the database. This layer also acts as a mediator between the end-user and the database.

Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you connect and get data out from the data warehouse. It could be Query tools, reporting tools, managed query tools, Analysis tools and Data mining tools.

Q7) What is the Business use of data warehouse?A7) Organizations have a common goal – to make better business decisions. A data warehouse, once implemented into your business intelligence framework, can benefit your company in numerous ways. A data warehouse:1. Delivers enhanced business intelligenceBy having access to information from various sources from a single platform, decision makers will no longer need to rely on limited data or their instinct. Additionally, data warehouses can effortlessly be applied to a business’s processes, for instance, market segmentation, sales, risk, inventory, and financial management.2. Saves timesA data warehouse standardizes, preserves, and stores data from distinct sources, aiding the consolidation and integration of all the data. Since critical data is available to all users, it allows them to make informed decisions on key aspects. In addition, executives can query the data themselves with little to no IT support, saving more time and money.3. Enhances data quality and consistencyA data warehouse converts data from multiple sources into a consistent format. Since the data from across the organization is standardized, each department will produce results that are consistent. This will lead to more accurate data, which will become the basis for solid decisions.4. Generates a high Return on Investment (ROI)Companies experience higher revenues and cost savings than those that haven’t invested in a data warehouse.5. Provides competitive advantageData warehouses help get a holistic view of their current standing and evaluate opportunities and risks, thus providing companies with a competitive advantage.6. Improves the decision-making processData warehousing provides better insights to decision makers by maintaining a cohesive database of current and historical data. By transforming data into purposeful information, decision makers can perform more functional, precise, and reliable analysis and create more useful reports with ease.7. Enables organizations to forecast with confidenceData professionals can analyze business data to make market forecasts, identify potential KPIs, and gauge predicated results, allowing key personnel to plan accordingly.8. Streamlines the flow of informationData warehousing facilitates the flow of information through a network connecting all related or non-related parties.In business intelligence, data warehouses serve as the backbone of data storage. Business intelligence relies on complex queries and comparing multiple sets of data to inform everything from everyday decisions to organization-wide shifts in focus.To facilitate this, business intelligence is comprised of three overarching activities: data wrangling, data storage, and data analysis. Data wrangling is usually facilitated by extract, transform, load (ETL) technologies, which we’ll explain in detail below, and data analysis is done using business intelligence tools.The glue holding this process together is data warehouses, which serve as the facilitator of data storage using OLAP. They integrate, summarize, and transform data, making it easier to analyze.Even though data warehouses serve as the backbone of data storage, they’re not the only technology involved in data storage. Many companies go through a data storage hierarchy before reaching the point where they absolutely need a data warehouse. Q8) Write the Business Applications of Data warehousing.A8) The Business Applications of Data warehousing are as follows: Banking IndustryIn the banking industry, concentration is given to risk management and policy reversal as well analyzing consumer data, market trends, government regulations and reports, and more importantly financial decision making.Most banks also use warehouses to manage the resources available on deck in an effective manner. Certain banking sectors utilize them for market research, performance analysis of each product, interchange and exchange rates, and to develop marketing programs.Analysis of card holder’s transactions, spending patterns and merchant classification, all of which provide the bank with an opportunity to introduce special offers and lucrative deals based on cardholder activity. Apart from all these, there is also scope for co-branding.Finance IndustrySimilar to the applications seen in banking, mainly revolve around evaluation and trends of customer expenses which aids in maximizing the profits earned by their clients.Consumer Goods IndustryThey are used for prediction of consumer trends, inventory management, market and advertising research. In-depth analysis of sales and production is also carried out. Apart from these, information is exchanged business partners and clientele.Government and EducationThe federal government utilizes the warehouses for research in compliance, whereas the state government uses it for services related to human resources like recruitment, and accounting like payroll management.The government uses data warehouses to maintain and analyze tax records, health policy records and their respective providers, and also their entire criminal law database is connected to the state’s data warehouse. Criminal activity is predicted from the patterns and trends, results of the analysis of historical data associated with past criminals.Universities use warehouses for extracting of information used for the proposal of research grants, understanding their student demographics, and human resource management. The entire financial department of most universities depends on data warehouses, inclusive of the Financial Aid department. HealthcareOne of the most important sector which utilizes data warehouses is the Healthcare sector. All of their financial, clinical, and employee records are fed to warehouses as it helps them to strategize and predict outcomes, track and analyze their service feedback, generate patient reports, share data with tie-in insurance companies, medical aid services, etc.Hospitality IndustryA major proportion of this industry is dominated by hotel and restaurant services, car rental services, and holiday home services. They utilize warehouse services to design and evaluate their advertising and promotion campaigns where they target customers based on their feedback and travel patterns.InsuranceAs the saying goes in the insurance services sector, “Insurance can never be bought, it can be only be sold”, the warehouses are primarily used to analyze data patterns and customer trends, apart from maintaining records of already existing participants. The design of tailor-made customer offers and promotions is also possible through warehouses.Manufacturing and Distribution IndustryThis industry is one of the most important sources of income for any state. A manufacturing organization has to take several make-or-buy decisions which can influence the future of the sector, which is why they utilize high-end OLAP tools as a part of data warehouses to predict market changes, analyze current business trends, detect warning conditions, view marketing developments, and ultimately take better decisions.They also use them for product shipment records, records of product portfolios, identify profitable product lines, analyze previous data and customer feedback to evaluate the weaker product lines and eliminate them.For the distributions, the supply chain management of products operates through data warehouses.The RetailersRetailers serve as middlemen between producers and consumers. It is important for them to maintain records of both the parties to ensure their existence in the market.They use warehouses to track items, their advertising promotions, and the consumers buying trends. They also analyze sales to determine fast selling and slow selling product lines and determine their shelf space through a process of elimination.Services SectorData warehouses find themselves to be of use in the service sector for maintenance of financial records, revenue patterns, customer profiling, resource management, and human resources.Telephone IndustryThe telephone industry operates over both offline and online data burdening them with a lot of historical data which has to be consolidated and integrated.Apart from those operations, analysis of fixed assets, analysis of customer’s calling patterns for sales representatives to push advertising campaigns, and tracking of customer queries, all require the facilities of a data warehouse.Transportation IndustryIn the transportation industry, data warehouses record customer data enabling traders to experiment with target marketing where the marketing campaigns are designed by keeping customer requirements in mind.The internal environment of the industry uses them to analyze customer feedback, performance, manage crews on board as well as analyze customer financial reports for pricing strategies. Q9) Explain the Data Mining Applications.A9) The Data Mining Applications are as follows:Financial Data AnalysisThe financial data in banking and financial industry is generally reliable and of high quality which facilitates systematic data analysis and data mining. Some of the typical cases are as follows −

Design and construction of data warehouses for multidimensional data analysis and data mining.

Loan payment prediction and customer credit policy analysis.

Classification and clustering of customers for targeted marketing.

Detection of money laundering and other financial crimes.

Retail IndustryData Mining has its great application in Retail Industry because it collects large amount of data from on sales, customer purchasing history, goods transportation, consumption and services. It is natural that the quantity of data collected will continue to expand rapidly because of the increasing ease, availability and popularity of the web.Data mining in retail industry helps in identifying customer buying patterns and trends that lead to improved quality of customer service and good customer retention and satisfaction. Here is the list of examples of data mining in the retail industry −

Design and Construction of data warehouses based on the benefits of data mining.

Multidimensional analysis of sales, customers, products, time and region.

Analysis of effectiveness of sales campaigns.

Customer Retention.

Product recommendation and cross-referencing of items.

Telecommunication IndustryToday the telecommunication industry is one of the most emerging industries providing various services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data transmission, etc. Due to the development of new computer and communication technologies, the telecommunication industry is rapidly expanding. This is the reason why data mining is become very important to help and understand the business.Data mining in telecommunication industry helps in identifying the telecommunication patterns, catch fraudulent activities, make better use of resource, and improve quality of service. Here is the list of examples for which data mining improves telecommunication services

Multidimensional Analysis of Telecommunication data.

Fraudulent pattern analysis.

Identification of unusual patterns.

Multidimensional association and sequential patterns analysis.

Mobile Telecommunication services.

Use of visualization tools in telecommunication data analysis.

Biological Data AnalysisIn recent times, we have seen a tremendous growth in the field of biology such as genomics, proteomics, functional Genomics and biomedical research. Biological data mining is a very important part of Bioinformatics. Following are the aspects in which data mining contributes for biological data analysis −

Semantic integration of heterogeneous, distributed genomic and proteomic databases.

Alignment, indexing, similarity search and comparative analysis multiple nucleotide sequences.

Discovery of structural patterns and analysis of genetic networks and protein pathways.

Association and path analysis.

Visualization tools in genetic data analysis.

Other Scientific ApplicationsThe applications discussed above tend to handle relatively small and homogeneous data sets for which the statistical techniques are appropriate. Huge amount of data have been collected from scientific domains such as geosciences, astronomy, etc. A large amount of data sets is being generated because of the fast numerical simulations in various fields such as climate and ecosystem modeling, chemical engineering, fluid dynamics, etc. Following are the applications of data mining in the field of Scientific Applications −

Data Warehouses and data preprocessing.

Graph-based mining.

Visualization and domain specific knowledge.

Intrusion DetectionIntrusion refers to any kind of action that threatens integrity, confidentiality, or the availability of network resources. In this world of connectivity, security has become the major issue. With increased usage of internet and availability of the tools and tricks for intruding and attacking network prompted intrusion detection to become a critical component of network administration. Here is the list of areas in which data mining technology may be applied for intrusion detection −

Development of data mining algorithm for intrusion detection.

Association and correlation analysis, aggregation to help select and build discriminating attributes.

Analysis of Stream data.

Distributed data mining.

Visualization and query tools.

Q10) What is the scope and the techniques used in Data Mining?A10) Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities:

Automated prediction of trends and behaviors: Data mining automates the process of finding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data — quickly. A typical example of a predictive problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of a population likely to respond similarly to given events.

Automated discovery of previously unknown patterns: Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions and identifying anomalous data that could represent data entry keying errors.

Data mining techniques can yield the benefits of automation on existing software and hardware platforms, and can be implemented on new systems as existing platforms are upgraded and new products developed.Databases can be larger in both depth and breadth:

More columns: Analysts must often limit the number of variables they examine when doing hands-on analysis due to time constraints. Yet variables that are discarded because they seem unimportant may carry information about unknown patterns. High performance data mining allows users to explore the full depth of a database, without preselecting a subset of variables.

More rows: Larger samples yield lower estimation errors and variance, and allow users to make inferences about small but important segments of a population.

The most used techniques in data mining are:

Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.

Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID).

Genetic algorithms: Optimization techniques that use process such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.

Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique.

Rule induction: The extraction of useful if-then rules from data based on statistical significance.

Detection of anomalies: Identifying unusual values in a dataset.

Dependency modeling: Discovering existing relationships within a dataset. This frequently involves regression analysis.

Clustering: Identifying structures (clusters) in unstructured data.

Classification: Generalizing the known structure and applying it to the data.

Many of these technologies have been in use for more than a decade in specialized analysis tools that work with relatively small volumes of data. These capabilities are now evolving to integrate directly with industry-standard data warehouse and OLAP platforms.

Sign Up