Skip to Main Content
  • Questions
  • ETL - Using the wrong tool for the job

Breadcrumb

Question and Answer

Tom Kyte

Thanks for the question, Dave.

Asked: April 01, 2006 - 4:22 pm UTC

Last updated: September 10, 2013 - 9:40 am UTC

Version: 9.2.?

Viewed 10K+ times! This question is

You Asked

First, thanks so much for all the work you do on this site. I've learned so much and it's a real blessing. I've also really enjoyed your Expert Oracle Architecture book.

My company has chosen Informatica as its ETL tool. We're using Oracle as our data warehouse platform. But I think we're going about using Informatica in the wrong way in many instances. The way I look at a tool like Informatica is that it should be used to go get data from sources that you can't get to with Oracle. A good example would be a datacom table on a mainframe 3090. To my knowledge (correct me if I'm wrong here), I don't think you can get to a datacom table with out of the box Oracle tools.

But the approach my company is taking is that they are pushing for *all* ETL to be put into this tool. In other words, even if the data *is already* in an Oracle (via external table, etc.), they want the ETL to be done in Informatica and are against using PL/SQL. An example would be an invoice file that comes from the 3090 that we hook up as an external table, and then run an Informatica ETL to load it to the "real" Oracle table.

Management's argument is that they don't want "business and data transformation rules" buried inside PL/SQL code. My argument is that you should keep the business rules as close to the data as possible and do it as efficiently as possible, which I believe is the philosophy you subscribe to. If the data's already in Oracle, why take it out?

Just wondered if you had any advice or arguments that I could present to them...

and Tom said...

We can get to datacom db data - we have gateways to many existing databases and use them all of the time. But that aside....


Management is mis-speaking. It is not that "business and data transformation rules" would be 'buried' in plsql - it is that since they are using informatica, things written in plsql would be "outside the realm of the meta-data informatica keeps"

To me - the primary reason to use an ETL tool is "documentation and maintanence". You will likely find the tool getting in the way in many cases (times you just want to through your hands up and say 'lets just code it already'). But the tool isn't there necessarily to make this aspect easier - it is there to provide continuity, documentation, change management, the ability to figure out where the data comes from and where it goes. This is especially important for future generations of developers that will follow you (and want - no, need - that meta data).

Probably not what you expected me to answer :) We have our own ETL tool - we call it OWB (Oracle Warehouse Builder). It happens to generate plsql and oracle specific SQL (and does things in SETS which is great). But it has the same sort of meta data approach. We could argue which is better - Informatica vs OWS but at the end of the day - you are using them not necessarily because it'll make the entire ETL process easier or even faster - but because of the documentation they provide. On a large scale effort - with many developers, lots of data sources and a lifespan that should be "long" (eg: maintaining this system is relevant!) - they make sense.

Rating

  (12 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

David Aldridge, April 10, 2006 - 5:21 pm UTC

A pure Informatica process is mostly unfeasible -- you need to call stored procedures to disable and rebuild indexes,validate dimensions and whatnot.

The one problem that irritates me most with Informatica is its insistence that 2.5*1.3=3.24999999... (for example), which can ruin your day when you need to compare three data values to see if a*b=c. In those cases I've always reverted to embedding the logic in a decode statement as part of a view or a SQL override on the source qualifier. Once you start doing that sort of thing its a slippery slope ....


a*b=c

Kirtan, March 28, 2007 - 3:33 pm UTC

I stumblled across this question while browsing through asktom.
David,
Just an FYI for future, you could turn high precision on in Informatica to do that sort of things. I have been using informatica for a while now and I find the combination of PL/SQL and Informatica running side by side quite prodcutive.

Many Instances of Wrong

A reader, August 14, 2007 - 9:17 am UTC

I am frustrated with my company's insistance on using Informatica for simple data movements. For example, we are pulling data from a remote database via a single SQL query and DB LINK. There are no calculations being performed, there are very few rows being returned. In short, there's no ETL at all. Just a remote DB query.

As we were having our production meeting, the suggestion is made that we change those queries to use Informatica because "we don't want hits on the database". Which, to me, sounds nonsensical since Informatica has to hit the database as well.

It also sounds to me as though they are trying to create work since it takes several weeks to productionize even the most simple workflows.

A database link takes 5 minutes to create and use. An Informatica workflow takes several weeks.

There's no comparison in that case.

To me, if the purpose is for E T L, then Informatica is the right tool. If youre trying to create work, then go to a different company...

thoughts?
Tom Kyte
August 20, 2007 - 11:04 am UTC

my thoughts....


I don't get their argument about "we don't want hits on the database". For your stated reason (informatica will actually HIT THE DATABASE HARDER than your current database link + single sql statement), as well as the fact that - well, why did you buy a database if you are afraid to use it?

Need an expert review of a statement from book: "The datawarehouse ETL Toolkit"

Giri, June 12, 2008 - 2:52 am UTC

Hi Tom,

In this context of PL/SQL vs ETL, here is a statement from the book: "The datawarehouse ETL Toolkit". Author: Ralph Kimball. (Page:300, Chapter:7, Published:2004, ISBN:0-764-56757-8)

"ETL Products are specifically designed to extract, transform, and load massive amounts like no other nondata wareousing solution. With minimal exception, most databases are designed to support transactional and operational applications. Database-procedural programming is good for supporting data-entry applications but is not optimal for processing large data sets at once. The use of cursors-where each record is analyzed individually before moving on to the next-is notoriously slow and usually results in unacceptable prefromance while processing very large data sets. Instead of using procedures stored within the databses, it's beneficial to utilize the ETL engine for manipulating and managing the data".
If i read these sentences carefully, i have following observations.

1) He means to imply that incorrectly implemented cursors like row by row, slow by slow ones, but not referring to correct implementations like ref cursors and bulk fetch/collect.

2) Some of the statements in this paragraph needs to be taken with a grain of salt and it all boils down to the fact that "it depends" and needs to be bench marked.

3) I some how don't agree with this statement. Probably Kimball went a little overboard against db and more biased towards an ETL Tool.

Database-procedural programming is good for supporting data-entry applications but is not optimal for processing large data sets at once.

4) Don't believe every thing written in books even by gurus, and probably some of the statements apply for very specific scenarios, most of the time incorrectly implemented scenarios.

It would be great if you could provide your expert observation on the ETL tool.
Tom Kyte
June 12, 2008 - 8:05 am UTC

1) he means "non-bulk things", yes, if you do all of your ETL slow by slow (which, in my experience, most of these external ETL engines do!!!! They are notorious for slow by slow processing - they are not designed "in bulk" as he suggests - they should be, but are not), it'll be very slow.

If you can do it in a single sql statement - there you go, best ever. That is the MOST optimal way to process large data sets (never let it OUT of the database, "etl engines" tend to a) extract (hence the E), b) write to files on disk, c) transform (hence the T), d) write to files on disk, and then e) load (hence the L).

I'd rather just T. Skip E and L.

2) correct, as with everything here

3) I concurr. There is a bit of database independence going on there as well - which if you get caught up in, you'll end up doing things in a manner that gets poor performance in one database (the one you initially develop on) and horrible performance on the rest (those to which you "port")

PLSQL is just a language, SQL is a language, Java, C, fortran - they are all languages. the ETL engine is written in *some* language and that language would ultimately be comparable to PLSQL as far as programming constructs. You can write bad plsql code, you can write bad C code. It is in the implementation - your algorithms.

Look at the ETL engine that is OWB (Oracle Warehouse Builder). It generates............ PLSQL for everything (it skips the E and L steps whenever possible). It has the ability to generate slow by slow, bulk process, or single SQL statement implementations - you choose for the most part.

A language should not be damned for what it is generally used for.

4) perfect.

There should be some kind of framework for ETL

Reader, July 28, 2008 - 6:58 am UTC

I have used Informatica and when it came to performance we had a hard time to get the data in-time before business hours.

I am not sure if the latest version has improved, not using it anymore.

I think if we have data in Oracle DB itself we can very well use SQL, PL/SQL which might give better performance.

For the above solution to be more productive (in terms of development time) I think there should be some kind of framework (built in Oracle or on top of it by 3rd party) using PL/SQL which can take care of common task involved in ETL, just like Java has so many framework for specific areas. I am not sure if something similar already exists.

It should just provide a common framework which can be easily used and extended. It should not end up doing everything and become like a ETL tool :)

What do you think?

PL/SQL Vs ETL !!!

Ankit, January 20, 2012 - 12:38 am UTC

Hi Tom

Its wonderfull to read your Oracle words of wisdom. Thanks for all that :))

I am currently working on a DWH application (build on Oracle 10g R2) which uses PL/SQL procedures for extracting data via a DB link into intermediate STAGING tables, then use PL/SQL procedures to transform it a bit and load into MAIN tables, majorly using MERGE statements. Currently the load takes Max 3.5 hrs to finish. Indexes and hints are in place, a job runs weekly for statistics collection on schema. There are around 550 jobs and 450 tables to be filled up. The guy who developed it said "what's the need to use some ETL tool when everything at-last has to get converted into PL/SQL only".

I need to propose value addition things for this application.
I want to know :

1) Will it be beneficial to use some ETL tool instead of above given approach for this application ? If yes, what benefits are there with ETL tools ? Which tool do you propose, OWB or informatica ?
2) How much true is the statement "what's the need to use some ETL tool when everything atlast has to get converted into PL/SQL only" ?
3) Would you suggest anything other than MERGE statement to improve efficiency ?

Any other suggestion from you will be highly appreciated.

Thanks.
Tom Kyte
January 20, 2012 - 9:47 am UTC

Indexes and hints are in place,

that scares me. Don't like the idea of hints and if your ETL uses a ton of indexes, you must be doing slow by slow processing :(

1) see the original answer above, starting with the "to me - the primary reason..".

2) I would say it is not entirely accurate. I would like it to be converted into SQL only with a tiny bit of PLSQL if necessary.

3) avoid conventional path insert (insert without append), avoid update, avoid delete, avoid merge. Use create table as select (CTAS) or insert /*+ APPEND */ to load massive amounts of information. Instead of update or merge - do the operation as a create table as select and get rid of the old information (this can happen at the partition level or whatever).

insert (without append), update, delete, merge - invented mostly for OLTP, not for doing millions of rows.

CTAS, direct path load (insert /*+ APPEND */) invented for warehousing - it skips undo, can skip redo, and allows the resulting data to be compressed to the max.


Difference between ETL & ELT

Waran, September 03, 2013 - 7:58 am UTC

What is the difference between ETL and ELT
Tom Kyte
September 04, 2013 - 6:58 pm UTC

it is much harder and less efficient in general to do the T (transformation) OUTSIDE of the database. the database was born to transform data with big set based direct path operations.


so extracting to disk, transforming to disk, loading back into a database

is in my experience much more tedious and much slower than

extract and load into the database, transform

or even

extract and (load & transform) into the database


http://www.youtube.com/watch?v=2jWq-VUeOGs

get over the wrong concepts

Amir Riaz, September 08, 2013 - 7:57 am UTC

A lot of wrong concepts have been projected in this thread.

in reality there is no pure ETL and ELT loading. A well designed data loading is a mixture of both. That is why Tools like Informatica gives both.

With ELT loading, the biggest problem is to transfer the data between heterogeneous systems. for example between teradata and oracle. while Oracle provides gateways and other technique most of them are time consuming and unrealistic in nature.

Oracle ELT tool ODI implements the data movement between heterogeneous systems by JDBC, though File transfer. none of them are good and they show the real weakness of that tool

Oracle after so many years still not able to understand the requirements of a true ETL or ELT tool, because most of them are either DBAs or PL/SQL programmers.
Tom Kyte
September 10, 2013 - 9:40 am UTC

huh???


elt = extract, load and then transform (forget any tools, if you have hundreds of gb's or tb's of data - you'll be doing this down to the wire, not with pretty pictures and push buttons)

etl = extract, transform and then load - without using the database to transform


elt = going light speed
etl = going by boat


Amir Riaz, September 14, 2013 - 12:03 pm UTC

Dear Tom,

yes, you are right and I understand that but did you ever evaluated ODI against some other tools. They are trying to fix it from last 6 years and still there are no where.

OBIEE and Exadata are wonderful products but once you look for ELT tool from oracle, ODI really presents poor choice. with poor interfaces and illogical descriptive style.

just describe how do you move data from teradata to oracle if we cannot use oracle gateways, the data is huge and require parallel processing.

A few weeks back i have this problem and only solution was to use Oracle gateways, image what its consequence will be, if you everything you have to go to DBA.

ETL

Dragutin Jastrebic from France, September 15, 2013 - 7:16 pm UTC

I have worked on several datawarehouse projects,and we have always used the staging area (extract, transform) inside the database.
But there are many different situations.
For exemple, in his book "The datawarehouse ETL Toolkit" (Chapter 2: ETL data structures) , Ralph Kimball gives arguments in favor of database tables but also arguments in favor of flat files

Dragutin

ETL

David Aldridge, September 17, 2013 - 6:56 am UTC

Just as background information the last time I worked with Informatica Powercenter, about 5 years ago, there was an option to push code execution down to the database rather than running it through the Powercenter server. So, if you had a simple transformation within a single database it could be rewritten to a SQL statement -- an update, for example -- and executed on the database server.

This was a performance enhancement for which an extra license fee was payable.

So just to clarify, if you had paid for Powercenter but still needed higher performance you could pay extra to not use Powercenter.

Alexander, April 21, 2014 - 4:29 pm UTC

Hi Tom.

I have a strange problem that I'm hoping to get any ideas you may have. We use an Oracle product called OBIEE. It has an ETL and recently the support team had a problem in both production and test, where a query with no where clause was not returning all the rows it should have been. For example, the query was supposed to return about 100k rows from a table but was only getting about 80-90k. For some reason, dropping and recreating the table resolved the issue, and after it returned all the rows.

Can you think of anything that could possibly cause this? I don't even know where to begin it's such a strange problem. I'm trying to rule out the database as a cause for the post-mortem analysis.

This is the query just so you can see it:

SELECT EMPLOYEE_NUMBER, 
TASK_NAME, 
RESULTS_ACHIEVED, 
POTENTIAL, 
RESULTS_POTENTIAL_RAW, 
RESULTS_POTENTIAL, 
READINESS_2011, 
READINESS, 
IC_READY_TO_BE_AN_FLM, 
PARA_READY_TO_BE_A_SUPERVISOR, 
ARE_YOU_WILLING_TO_RELOCATE, 
RETENTION_RISK, 
IMPACT_OF_LOSS, 
POTENTIAL_NEXT_ROLES, 
READY_FOR_IMMED_MOVE_FLG 
FROM 
WC_ELC_TALENT_MSTR_ALL