Partitioning in Informatica

  • When running sessions , the integration service process can achieve high performance by partitioning the pipeline and performing the extract, transformation and load for each partition in parallel. This partition option provides a thread-based architecture and automatic data partitioning that optimizes parallel processing on multiprocessor and grid-based hardware environments. 
  • The Power Center Integration Services creates a default partition type at each partition point. If you have the Partitioning option, you can change the partition type. The partition type controls how the Power Center Integration Service distributes data among partitions at partition points. When you configure the partitioning information for a pipeline, you must define a partition type at each partition point in the pipeline. The partition type determines how the Power Center Integration Service redistributes data across partition points.
  • We can configure a session for dynamic partitioning to enable the integration service to set partitioning at run time. When we enable dynamic partitioning , the integration service scales the number of session partitions based on factors such as the source database partitions or the number of nodes in a grid.
  • For relational sources, the integration service creates multiple database connections to a single source and extracts a separate range of data for each connection.
  • When integration service loads relational data, it creates multiple database connections to the target and loads partitions of data concurrently. When the integration service loads data to file targets, it creates a separate file for each partition.
  • Setting partition attributes includes partition points, the number of partitions, and the partition types. In the session properties we can add or edit partition points.
1.Partition points
  • By default integration service sets partition points at various transformations in the pipeline.  
  • The partition points mark thread boundaries and divide the pipeline into stages.
  • A stage is a section of a pipeline between any two partition points.
  • When we add a partition point, we increase the number of pipeline stages by one. 
2.Number of partitions
  • If we increase the number of partition points number of threads also increases.
  • We can define up to 64 partitions at any partition point in a pipeline. 
  • The number of partitions remains consistent throughout the pipeline. The Integration Service runs the partition threads concurrently.
3.Partition types
We can configure the partition type at most transformation in the pipeline. The partition type controls how the Integration Service distributes data among partitions at partition points. The integration service can partition data using following partition types.
  • Database partitioning: The PowerCenter Integration Service queries the IBM DB2 or Oracle system for table partition information. It reads partitioned data from the corresponding nodes in the database. Use database partitioning with Oracle or IBM DB2 source instances on a multi-node table space. Use database partitioning with DB2 targets.
  • Hash partitioning: Use hash partitioning when you want the Power Center Integration Service to distribute rows to the partitions by group. For example, you  need to sort items by item ID, but you do not know how many items have a particular ID number. You can use the following types of hash partitioning.
  • Hash auto Keys: The Power Center integration service uses all grouped sorted ports as a compound partition key. We my need to use hash auto keys partitioning at rank, sorter and unsorter, aggregator transformations.
  • Hash user keys: The power Center integration service uses a a hash function to group rows of data among partitions. We define the number of ports to generate the partition  key.      
  • Key range: You specify one or more ports to form a compound partition key. The PowerCenter Integration Service passes data to each partition depending on the ranges you specify for each port. Use key range partitioning where the sources or targets in the pipeline are partitioned by key range.
  • Pass-through:The Power Center Integration Service passes all rows at one partition point to the next partition point without redistributing them. Choose pass-through partitioning where you want to create an additional pipeline stage to improve performance, but do not want to change the distribution of data across partitions.
  • Round-robin: The PowerCenter Integration Service distributes blocks of data to one or more partitions. Use round-robin partitioning so that each partition processes rows based on the number and size of the blocks.
  • We cannot create partition points at Source instances or at Sequence Generator transformations.
  • We cannot create a partition key for hash auto-keys, round-robin, or pass-through types partitioning
  • As an alternative to partitioning, you may also use native database options to increase degree of parallelism of query processing. 

Powered by Blogger.