Saturday, October 29, 2016

Advantages of using an ETL tool over Stored Procedures in any DB

  • Easier to maintain the code – through atomicity and simpler interfaces
  • Many reusable components – a lot of code (e.g. Lookups) has already been constructed
  • More flexible than SPs – they have more functionality (e.g. Cached Lookups, FTP)
  • Can deliver better performance than SPs – use of In Memory techniques can boost performance
  • Features such as logging/audit and metadata are built in
  • Support is more broadly available – it’s easier to find out why a prebuilt, widely used component is throwing errors

Whats the Difference between DWH and a Database?

In a very generic sense, the two may appear to be similar, but there are very important differences, in architecture, technology and usage patterns.
The term "database" is very broadly defined and covers a lot of stuff, but the basic ideas are relatively simple to understand.
A database, according to the most common meaning of the word, is repository of information that is used as a backing data storage for some specific application or set of applications. Databases are usually structured, but this is not a definitive technical requirement. Databases are often used in contexts where you need to be able to access and update data online, in near real time, and with multiple concurrent accesses. Real world, professional-use databases have stricter requirements; for example, security (controlling who can access and who can change any bit of information), and also consistency, in the form of transactions, that guarantee that the information can be reliably used by their clients.
A data warehouse is a particular type of database, which focus on a very specific application: storing, filtering, retrieving and analyzing huge volumes of information. This application imposes a different set of constraints and leads to a completely different architecture and usage pattern.
  1. Inside the same company, data warehouses are potentially much bigger than the databases from where the data is derived. Databases usually store only the data that is currently in active use; older records can be purged and moved to backups, mainly for performance reasons. Data warehouses are used to store much older historical records; it's also common to use data warehouses to store additional information that is bought or captured elsewhere to complement the information that is generated and stored by the internal database system.
  2. Access patterns are different. Database records are often retrieved and updated one by one; data warehouses are nearly always acessed by reporting engines that work on entire datasets at a time to generate aggregates and other analytical information. Databases are frequently updated, sometimes only a field or record at a time; data warehouses aren't updated very frequently, and for all practical purposes, never at the field or record level; instead data is appended in large batches.
  3. Databases are usually normalized, which means that a lot of work is done to guarantee that there's a unique copy of any given bit of information, which is important for performance and consistency reasons. But it's common to store different versions of the same information on a data warehouse, using different structures to compose and access the information. In other words, data warehouses are messier and more irregular, partly by design, as they need to be able to work with so many different sources of information.
  4. Data warehouses are internally stored in a completely different format compared to traditional databases. There is much less focus on traditional indexes and joins, and relationships are expressed by other mechanisms. You just don't do a SQL query with a join, you (usually) use a custom query builder to create your reports (yes, you don't call them queries), using features that are specific for your data warehouse provider and information architecture. Some engines use structures that are designed to work completely in memory; other engines use alternative storage mechanisms like column-oriented databases.