Skip to Main Content
  • Questions
  • Extract data from a table to Hadoop File System,

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, shiva.

Asked: January 20, 2017 - 11:26 pm UTC

Last updated: January 21, 2017 - 3:52 am UTC

Version: 12.1

Viewed 1000+ times

You Asked

Team - What are the best options to extract data from an Oracle table (could have CLOB kind of columns) to a HDFS location?

Currently we are trying "sqoop" but it is pretty slow and it intermittently fails with Connection Reset error (we are using jdbc connection). Just an example of how our sqoop statement looks like:

export CLASSPATH=$CLASSPATH:/data/ggadmin/dev/ojdbc6.jar
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/data/ggadmin/dev/ojdbc6.jar
export JAVA_HOME=/data/ggadmin/ggs/client/11.2.0.4/jdk
export PATH=$PATH:$JAVA_HOME/bin

sqoop import \
--connect jdbc:oracle:thin:SERVICE_GG/*****@****:1521/*** \
--table GPD_HUB.EFF_PT_CD_VER -m 4 \
--target-dir /user/mwzh43v/GPD/GPD_HUB/EFF_PT_CD_VER_hist \
--direct \
--null-string '' \
--null-non-string '' \
--fields-terminated-by '\034'

We get a whole bunch of errors like "Caused by: java.sql.SQLRecoverableException: IO Error: Connection reset"

Any other option besides this?

Thanks,




and Connor said...

We have a product for that

https://docs.oracle.com/bigdata/bda45/BDSUG/copy2bda.htm#BIGUG76736

and there other 3rd party alternatives out there as well (eg Gluent)

There's also some good notes about getting the best out of Sqoop in Tanel Poder's presentation below:

http://www.ooug.org/wp-content/uploads/2016/05/Tanel_Poder_Connecting_Hadoop_and_Oracle.pdf


Rating

  (1 rating)

Is this answer out of date? If it is, please let us know via a Comment

Comments

follow up,

shiva nagaraj, January 21, 2017 - 6:59 pm UTC

Thanks for sharing the articles. The first option seems good and the sqoop example that you have provided is probably we are already trying that.

Our biggest challenge is bringing 100's of Oracle database data into HDFS. We have a goal of getting 25 per week and databases could be in few hundred gigs to 4-5 TB in size. We prefer an automated approach and sqoop so far fits into this but our biggest challenge is the "connection reset" problem in jdbc and rerunning all those tables that encountered the connection problem.

Thanks,

More to Explore

CLOB/BLOB/etc

Complete documentation on Securefiles and Large Objects here