Data from a database table can only be imported as fast as the database allows it. You cannot run multiple copies of a database table input step.
Text files are more convenient, as you can copy them across servers (your cluster) and read them in simultaneously. Another really cool feature that Kettle has is that you can read in text files in parallel (check the "Run in parallel" option in the step configuration and specify the number of copies to start with in the step context menu). How does this work? If your file has a size of 4 GB and you specified 4 copies of the text step, each step will read in chunks of that file at the same time (to be more precise: the first copy will start reading at the beginning of the file, the second copy will start at the first line that is found after 1 GB, the third copy will start at the first line that is found after 2 GB and so on).
Run multiple step copies (Scale up)
Find out first how many cores your server has, as it doesn't make sense assigning more copies than there are cores available.
Linux: cat /proc/cpuinfo
Have a look how many times processor is mentioned. The first processor has the id 0.
Make sure you test your transformations first! You might get unexpected results (see my other blog postfor more details).
Usually you should get better performance by specifying the same amount of copies for steps where possible (herewith you create dedicated data pipelines and Kettle doesn't have to do the work of sending the rows in a round robin fashion to the other step copies).
Adjust Sort Rows step
Make sure you set a proper limit for Sort size (rows in memory) and/or maybe as well Free memory Threshold (in %).
You can run the Sort Rows step in multiple copies, but make sure you add a Sorted Merge step after it. This one will combine and sort the results in a streaming fashion.
Best practice to follow while running AI for large volumes of Data where Source Data is > 100K.
Bare Minimum Hardware:
OS: 64 bit
RAM: 8 GB
JVM: 64 bit.
It is strongly recommended that 32 bit systems are not used for running large integrations (>250K CIs). The best practices to run integrations are on a separate box with 64 bit hardware and operating system as specified above.
Configuration changes on the Plugin:
1> Initial loads where we are inserting data for the first time in CMDB, we can optimize performance by selecting options:
- a. Insert Always (Provided data in source is Unique)
- b. Select Cache required option.
2> Cache size default is 1M that needs to be changed to 4M if the volume of data is about 2-4M. 3> For incremental loads one has to disable Insert Always option so that the modified records are updated appropriately. 4> Number of threads on the CMDBOutput plugin should not exceed 10. For smaller sized machines, use a lower number. 5> Number of threads on CMDBLookUp plugin should always be set to 1. 6> Carte server should be run with JVM options of -Xmx and -Xms set to 4096m. 7> If run from Spoon always run the Transformation/Job with Logging mode set to Minimal. In Carte, it would be good to have logging disabled???? What is the default? 8> For Relationships as an best practice we can always turn of Updates.check below link also for improving perfmance of AI
What is the best way to do it in Kettle? Sort small chunks and then merge? Sometime we got out of memory issues.
You can indeed create a multi-threaded sorter.
Just make sure that when you get the data back from the multiple sort step copies, you use a "Sorted Merge" step to keep the rows sorted.
You can go multi-threaded on 2 levels: the thread level and the machine level.
Thread level: right click on the sort step: change the number of copies to start
Machine level: start up multiple instances of Carte to create multiple slave servers. Put them in a Cluster Schema (pick one master) and set that on the step. Then there is the option to specify the max amount of available memory threshold in versions >3.0.1. That should prevent the out of memory errors.
Also play with the temp file compression option. Most of the time it's faster if you DON'T compress.
If you read from files, try to use lazy conversion in the CSV input / Fixed Input steps. That will reduce the serialization effort for the sort steps.
There is a known problem with the "Java Script" step in compatibility mode in combination with lazy converted data, but other than that it should work fine (and fast).