Some general ETL tips
Consider organising it by
destination (for example, all the
code to produce the Customer
dimension lives in the same module, regardless of source).
This is sometimes known as
Subject-oriented ETL. It makes
finding stuff much easier and will
increase the maintainability of your
code.
If the SQL2000 database is a mess,
you will probably find that SSIS
data flows are a clumsy way to deal
with the data. As a rule, ETL tools
scale poorly with complexity;
something like half of all data
warehouse projects in finance
companies are done with stored
procedure code as an explicit
architectural decision - for precisely this reason. If you have
to put a large amount of code in
sprocs, consider putting all of the
code in sprocs.
For a system involving lots of complex scrubbing or transformations, a 100% sproc approach is far more maintainable as it is the only feasible way to put all of the transformations and business logic in one place. With mixed ETL/sproc systems, you have to look in multiple places to track, troubleshoot, debug or change the whole transformation.
The sweet spot of ETL tools is on systems where you have a larger number of data sources with relatively simple transformations.
Make the code testable, so you can
pick apart the components and test
in isolation. Code that can only be executed from within the middle of a complex data flow in an ETL tool is much harder to test.
Make the data extract dumb with no
business logic, and copy into a
staging area. If you have business
logic spread across the extract and
transform layers, you will have
transformations that cannot be tested
in isolation and make it hard to
track down bugs. If the transform is
running from a staging area you
reduce the hard dependency on the
source system, again enhancing
testability. This is a particular win on sproc-based architectures as it allows an almost completely homogeneous code base.
Build a generic slowly-changing
dimension handler or use one off the
shelf if available. This makes it
easier to unit test this
functionality. If this can be unit
tested, the system testing does not
have to test all of the corner cases,
merely whether the data presented to
it is correct. This is not as complex as it sounds - The last one I wrote was about 600 or 700 lines of T-SQL code. The same goes for any generic scrubbing functions.
Load incrementally if possible.
Instrument your code - have it make log entries, possibly recording diagnostics such as check totals or counts. Without this, troubleshooting is next to impossible. Also, assertion checking is a good way to think of error handling for this (does row count in a equal row count in b, is A:B relationship really 1:1).
Use synthetic keys. Using natural keys from the source systems ties your system to the data sources, and makes it difficult to add extra sources. The keys and relationships in the system should always line up - no nulls. For errors, 'not recorded', make a specific 'error' or 'not recorded' entries in the dimension table and match to them.
If you build an Operational Data Store (the subject of many a religious war) do not recycle the ODS keys in the star schemas. By all means join on ODS keys to construct dimensions, but match on a natural key. This allows you to arbitrarily drop and recreate the ODS - possibly changing its structure - without disturbing the star schemas. Having this capability is a real maintenance win, as you can change ODS structure or do a brute-force re-deployment of the ODS at any point.
Points 1-2 and 4-5 mean that you can build a system where all of the code for any given subsystem (e.g. a single dimension or fact table) lives in one and only one place in the system. This type of architecture is also better for larger numbers of data sources.
Point 3 is a counterpoint to point 2. Basically the choice between SQL and ETL tooling is a function of transformation complexity and number of source systems. The simpler the data and larger the number of data sources, the stronger the case for a tools-based approach. The more complex the data, the stronger the case for moving to an architecture based on stored procedures. Generally it's better to exclusively or almost exclusively use one or the other but not both.
Point 6 makes your system easier to test. Testing SCD's or any change based functionality is fiddly, as you have to be able to present more than one version of the source data to the system. If you move the change management functionality into infrastructure code, you can test it in isolation with test data sets. This is a win in testing, as it reduces the complexity of your system testing requirements.
Point 7 is a general performance tip that you will need to observe for large data volumes. Note that you may only need incremental loading for some parts of a system; for smaller reference tables and dimensions you may not need it.
Point 8 is germane to any headless process. If it goes tits up during the night, you want some fighting chance of seeing what went wrong the next day. If the code doesn't properly log what's going on and catch errors, you will have a much harder job troubleshooting it.
Point 9 gives the data warehouse a life of its own. You can easily add and drop source systems when the warehouse has its own keys. Warehouse keys are also necessary to implement slowly changing dimensions.
Point 10 is a maintenance and deployment win, as the ODS can be re-structured if you need to add new systems or change the cardinality of a record. It also means that a dimension can be loaded from more than one place in the ODS (think: adding manual accounting adjustments) without a dependency on the ODS keys.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…