I am trying to bulk load csv file to hbase using importtsv
and LoadIncrementalHFiles
tools that ship with Apache HBase.
We can find the tutorials at these pages: cloudera, apache
I am using Apache hadoop and hbase.
Both sources explains how to use these tools through command prompt. However I want to get this done from Java code. I know I can write custom map reduce as explained on cloudera page. However I want know if I can use classes corresponding to these tools directly in my Java code.
My cluster is running on Ubuntu VM inside VMWare in pseudo distributed mode, whereas my Java code is running on Windows host machine. When doing it through the command prompt on the same machine running cluster, we run following commands:
$HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-1.2.1.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2 -Dimporttsv.bulk.output=hdfs://192.168.23.128:9000/bulkloadoutputdir datatsv hdfs://192.168.23.128:9000/bulkloadinputdir/
As can be seen above we set HADOOP_CLASSPATH
. In my case, I guess I have to copy all xyz-site.xml
hadoop configuration files to my Windows machines and set the directory containing it as HADOOP_CLASSPATH
environment variable. So I copy pasted core-site.xml, hbase-site.xml, hdfs-site.xml
to my Windows machine, set the directory to Windows environment variable HADOOP_CLASSPATH
. Apart from these I also added all the required JARs to eclipse project's build path.
But after running the project I got following error:
Exception in thread "main" org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the locations
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:319)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:326)
at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:301)
at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:166)
at org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:161)
at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:794)
at org.apache.hadoop.hbase.MetaTableAccessor.fullScan(MetaTableAccessor.java:602)
at org.apache.hadoop.hbase.MetaTableAccessor.tableExists(MetaTableAccessor.java:366)
at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:403)
at org.apache.hadoop.hbase.mapreduce.ImportTsv.createSubmittableJob(ImportTsv.java:493)
at org.apache.hadoop.hbase.mapreduce.ImportTsv.run(ImportTsv.java:737)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.hadoop.hbase.mapreduce.ImportTsv.main(ImportTsv.java:747)
at HBaseImportTsvBulkLoader.createStoreFilesFromHdfsFiles(HBaseImportTsvBulkLoader.java:36)
at HBaseImportTsvBulkLoader.main(HBaseImportTsvBulkLoader.java:17)
So somehow importtsv
is still not able to find the location of the cluster.
This is how my basic code looks like:
1 import java.io.IOException;
2
3 import org.apache.hadoop.conf.Configuration;
4 import org.apache.hadoop.hbase.mapreduce.ImportTsv;
5 import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
6 import org.apache.hadoop.conf.Configuration;
7 import org.apache.hadoop.fs.FileStatus;
8 import org.apache.hadoop.fs.FileSystem;
9 import org.apache.hadoop.fs.Path;
10
11 public class HBaseImportTsvBulkLoader {
12 static Configuration config;
13
14 public static void main(String[] args) throws Exception {
15 config = new Configuration();
16 copyFileToHDFS();
17 createStoreFilesFromHdfsFiles();
18 loadStoreFilesToTable();
19 }
20
21 private static void copyFileToHDFS() throws IOException
22 {
23 config.set("fs.defaultFS","hdfs://192.168.23.128:9000"); //192.168.23.128
24 FileSystem hdfs = FileSystem.get(config);
25 Path localfsSourceDir = new Path("D:\delete\bulkloadinputfile1");
26 Path hdfsTargetDir = new Path (hdfs.getWorkingDirectory() + "/");
27 hdfs.copyFromLocalFile(localfsSourceDir, hdfsTargetDir);
28 }
29
30 private static void createStoreFilesFromHdfsFiles() throws Exception
31 {
32 String[] _args = {"-Dimporttsv.bulk.output=hdfs://192.168.23.128:9000/bulkloadoutputdir",
33 "-Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2",
34 "datatsv",
35 "hdfs://192.168.23.128:9000/bulkloadinputdir/"};
36 ImportTsv.main(_args); //**throws exception**
37
38 }
39
40 private static void loadStoreFilesToTable() throws Exception
41 {
42 String[] _args = {"hdfs://192.168.23.128:9000/hbasebulkloadoutputdir","datatsv"};
43 LoadIncrementalHFiles.main(_args);
44 }
45 }
46
Questions
Which all xyz-site.xml
fiels are required?
In what way should I be specifying HADOOP_CLASSPATH
?
Can I pass the required arguments to main()
methods of ImportTsv
such as -Dhbase.rootdir
below:
String[] _args = {"-Dimporttsv.bulk.output=hdfs://192.168.23.128:9000/bulkloadoutputdir",
"-Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2",
"-Dhbase.rootdir=hdfs://192.168.23.128:9000/hbase",
"datatsv",
"hdfs://192.168.23.128:9000/bulkloadinputdir/"};
Can I use ImportTsv.setConf()
to set the same?
See Question&Answers more detail:
os