DataStage Data Flow and Job Design. Nagraj Alur. Celso Takahashi. Sachiko Toratani. Denis Vasconcelos. IBM InfoSphere DataStage. IBM InfoSphere DataStage Data Flow and Job Design Download PDF ( MB) · Tips for viewing It enables organizations to integrate disparate data and deliver trusted information wherever and whenever needed. Paul Christensen. Develop highly efficient and scalable information integration applications. Investigate, design, and develop data flow jobs. Get guidelines for.
|Language:||English, Spanish, Japanese|
|Distribution:||Free* [*Registration needed]|
IBM InfoSphere DataStage and QualityStage. Version 8 Release .. In this exercise you design and run a simple parallel job that reads data from a text file, changes the v Links connect the stages along which your data flows. The Designer. Download PDF ( MB) · Download EPUB ( MB) InfoSphere DataStage is at the core of IBM Information Server, providing IBM InfoSphere DataStage Data Flow and Job Design, SG · Introduction to the New. This edition applies to Version of IBM InfoSphere Information Server. Note: Before .. InfoSphere. DataStage Data Flow and Job Design, SG and Deploying aracer.mobi Job design.
Analytical systems often must detect trends to enable managers to make strategic decisions. For example, a product definition in a sales tracking data mart is a dimension that will likely change for many products over time but this dimension typically changes slowly.
One major transformation and movement challenge is how to enable systems to track changes that occur in these dimensions over time. In many situations, dimensions change only occasionally.
Figure 8. Looking up primary key for a dimension table The Slowly Changing Dimension SCD stage processes source data for a dimension table within the context of a star schema database structure. The stage lets you overwrite the existing dimension known as a Type-1 change , update while preserving rows known as Type 2 , or have a hybrid of both types.
To prepare data for loading, the SCD stage performs the following process for each changing dimension in the star schema: Business keys from the source are used to look up a surrogate key in each dimension table.
Typically the dimension row is found.
If a dimension row is not found, a row must be created with a surrogate key. If a dimension row is found but must be updated Type-1 , the update must be done. For preserving history Type-2 , a new row is added and the original row is marked. A surrogate key is added to the source data and non-fact data is deleted.
In a Type-2 update, a new row with a new surrogate primary key is inserted into the dimension table to capture changes. All the rows that describe a dimension contain attributes that uniquely identify the most recent instance and historical dimensions. Figure 9 shows how the new product dimension is redefined to include the data that goes into the dimension table and also contains the surrogate key, expiry date, and the currency indicator. Figure 9.
Redefining a dimension table Finally, the new record is written into the dimension table with all surrogate keys , reflecting the change in product dimension over time. Although the product sales keeping unit has not changed, the database structure enables the user to identify sales of current versions versus earlier versions of the product. The Dynamic Relational stage reads data from or writes data to a database.
Figure 10 shows the general information about the database stage including the database type, name, user ID, and password that is used to connect. Passwords can be encrypted. Figure Although ODBC can be used to build SQL that will work for a broad range of databases, the database-specific parsers help you take advantage of database-specific functionality. The sequence can also contain control information.
For example, the sequence might indicate different actions depending on whether a job in the sequence succeeds or fails. After you define a job sequence, you can schedule and run the sequence by using the Director client, the command line, or an API. The sequence appears in the repository and in the Director client as a job.
Designing a job sequence is similar to designing jobs. You create the job sequence in the InfoSphere DataStage and QualityStage Designer, and add activities rather than stages from the tool palette.
You then join activities with triggers rather than links to define control flow. Each activity has properties that can be tested in trigger expressions and passed to other activities farther down the sequence. A new DataStage Repository Import window will open. This import creates the four parallel jobs. Inside the folder, you will see, Sequence Job and four parallel jobs.
Step 6 To see the sequence job. It will show the workflow of the four parallel jobs that the job sequence controls. It will set the starting point for data extraction to the point where DataStage last extracted rows and set the ending point to the last transaction that was processed for the subscription set. Then passes sync points for the last rows that were fetched to the setRangeProcessed stage.
So, the DataStage knows from where to begin the next round of data extraction Step 7 To see the parallel jobs. It will open window as shown below. It contains the CCD tables. In DataStage, you use data connection objects with related connector stages to quickly define a connection to a data source in a job design.
Step 3 You will have a window with two tabs, Parameters, and General. Click Open. Click on 'save' button.
In the designer window, follow below steps. Step 3 Click load on connection detail page. This will populate the wizard fields with connection information from the data connection that you created in the previous chapter. Step 4 Click Test connection on the same page.
You can see the message "connection is successful". Click Next. Step 5 Make sure on the Data source location page the Hostname and Database name fields are correctly populated.
Then click next. Step 6 On Schema page. The selection page will show the list of tables that are defined in the ASN Schema. It has the detail about the synchronization points that allows DataStage to keep track of which rows it has fetched from the CCD tables.
Click import and then in the open window click open. You need to modify the stages to add connection information and link to dataset files that DataStage populates. Join should be used when the data volume is high. It is a good alternative to the lookup stage and should be used when handling huge volumes of data. Join uses the paging method for the data matching. It can have multiple input links, a single output link, and it supports as many reject links as input links.
The Merge Stage takes sorted input.
It combines a sorted master data set with one or more sorted update data sets. The columns from the records in the master and update data sets are merged so that the output record contains all the columns from the master record plus any additional columns from each update record.
A master record and an update record are merged only if both of them have the same values for the merge key column s that you specify. Merge key columns are one or more columns that exist in both the master and update records. Merge keys can be more than one column. For a Merge Stage to work properly master dataset and update dataset should contain unique records. Merge Stage is generally used to combine datasets or files.
The Sort Stage is used to sort input dataset either in Ascending or Descending order. The Sort Stage offers a variety of options of retaining first or last records when removing duplicate records, Stable sorting, can specify the algorithm used for sorting to improve performance, etc.
Even though data can be sorted on a link, Sort Stage is used when the data to be sorted is huge. Therefore, if the volume of data is large explicit sort stage should be used instead of sort on link. The Transformer Stage is an active Stage, which can have a single input link and multiple output links. It is a very robust Stage with lot of inbuilt functionality.
Transformer Stage always generates C-code, which is then compiled to a parallel component. So the overheads for using a transformer Stage are high. Therefore, in any job, it is imperative that the use of a transformer is kept to a minimum and instead other Stages are used, such as:Copy Stage can be used for mapping input links with multiple output links without any transformations. Filter Stage can be used for filtering out data based on certain criteria.
Switch Stage can be used to map single input link with multiple output links based on the value of a selector field. It is also advisable to reduce the number of transformers in a Job by combining the logic into a single transformer rather than having multiple transformers. Funnel Stage is used to combine multiple inputs into a single output stream.
But presence of a Funnel Stage reduces the performance of a job. When a Funnel Stage is to be used in a large job it is better to isolate itself to one job.
Write the output to Datasets and funnel them in new job. Funnel Stage should be run in continuous mode, without hindrance. Each extra Stage put in a Job corresponds to lesser number of resources available for every Stage, which directly affects the Jobs Performance. If possible, big jobs having large number of Stages should be logically split into smaller units. Also if a particular Stage has been identified to be taking lot of time in a job, like a transformer Stage having complex functionality with a lot of Stage variables and transformations, then the design of jobs could be done in such a way that this Stage is put in a separate job all together more resources for the transformer Stage!!!
Also while designing jobs, care must be taken that unnecessary column propagation is not done. Columns, which are not needed in the job flow, should not be propagated from one Stage to another and from one job to the next.
Sorting in a job should be taken care try to minimise number sorts in a job. Design a job in such a way as to combine operations around same sort keys, if possible maintain same hash keys.
Most often neglected option is dont sort if previously sorted in sort Stage, set this option to true. This improves the Sort Stage performance a great deal. In Transformer Stage Preserve Sort Order can be used to maintain sort order of the data and reduce sorting in the job. In a transformer minimum of Stage variables should be used. More the no of Stage variable lower is the performance. An overloaded transformer can choke the data flow and lead to bad performance or even failure of job at some point.
In order to minimise the load on transformer we can Avoid some unnecessary function calls. For example to convert a varchar field with date value can be type cast into Date type by simple formatting the input value.