Connecting to CDH5 in an EC2 instance
Dear all,
I have recently launched an EC2 instance with a CDH 5.11 within it. All services seem to be up and running. I have passed several tests to validate the installation.
I have also installed RapidMiner Studio on my desktop as well as the Radoop extension. Currently, I am trying to connect to my hadoop cluster. The EC2 instance is not configured to use Elastic IPs, I am ussing tunnels through ssh session.
I am currently trying to pass the full test to validate the connection. Initially, configuration was imported from Cloudera Manager. Then I modified several properties to adjust to my environment. Hive, Java version, Map Reduce, NNode networking test connections have been passed successfully but I am stucked with the upload of a jar file to HDFS. I guess the problem is given by a previous warning when doing DataNode networking test:
WARNING: Reverse DNS lookup failed! Expected hostname for ip
WARNING: DataNode port 50010 on the ip/hostname
I guess that tunnel on port 50010 is working fine but there is something I am missing. Output of netstat command shows this port is listening to all IPs (0.0.0.0).
Things I have tried:
- Edit my local hosts file to resolve public ip to internal server hostname. Then Radoop complains because server is unreachable.
- Format namenode previously deleting all data in hdfs data directory
- Edit dfs.client.use.datanode.hostname and dfs.datanode.use.datanode.hostname on the client configuration to true.
- Try to upload a file using another client such as toad. Same error.
- Edit dfs.datanode.address in server to be like hostname:port is not allowed by Cloudera Manager. Only can be set as the port number.
——在客户端编辑dfs.datanode.address conf能源部s not change Radoop behaviour.
The error when trying to upload the jar file is the following:
[----] SEVERE: File /tmp/radoop/_shared/db_default/radoop_hive-v4_UPLOADING_1498636293395_dy8gaul.jar could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
客户知道number of datanodes in hdfs service. Could I say ssh tunnel on port 50010 is working fine? Can someone point me to the right direction?
Thank you!!
Best Answer
-
phellinger Employee, MemberPosts:103RM Engineering
Hi,
that is already some progress!
The client knows the number of DataNodes from the NameNode's response.
The client almost certainly won't be able to access the DataNodes directly, only through a SOCKS proxy, so the traffic goes through a master node.
You need to follow the instructions of "Configuring SOCKS Proxy and SSH tunneling" at
https://docs.www.turtlecreekpls.com/radoop/installation/networking-setup.html
In this case, you don't need to create tunnels one by one. Only one additional for Hive, see the description.
Or is it something you have already configured?
This thread may also be helpul.
Best,
Peter
0
Answers
Hi phellinger,
Thank you a lot, this was helpful. I did not read this documentation and I was trying 1 thousand tunnels.
I am now able to pass the quick test. Full test fails in hive table load. The error tells me to check user permissions on LOAD or CREATE statements, which I have already done and seems to be ok.
Can you point me to the right direction?
Thank you in advance!
Best,
Pau
Hi Pau,
great!
The Hive load test uploads an HDFS file to a temp dir, and uses the LOAD DATA Hive statement that will effectively move the file to the Hive warehouse directory.
If you enable the Log panel in Studio (View -> Show Panel -> Log) and set the log level (right click on the panel -> Set log level -> FINER), you will see the details.
Can you share more details (log) in PM or here?
Best,
Peter