学习教程地址: http://www.shouce.ren/api/view/a/11357
命令手册 https://hadoop.apache.org/docs/r1.0.4/cn/commands_manual.html
HDFS的常用操作–hdfs下的文件操作常用命令总结 http://www.aboutyun.com/blog-4073-518.html
hadoop HDFS常用文件操作命令 https://segmentfault.com/a/1190000002672666
安装
设定IP:
1 |
vim /etc/sysconfig/network-scripts/ifcfg-eth0 修改IP, 保存, 重启 |
服务器
1 2 3 4 |
192.168.0.51 H51 --->ambari-server 192.168.0.52 H52 --->hadoop 192.168.0.53 H53 --->hadoop 192.168.0.54 H54 --->hadoop |
打通本机到虚拟机的信道:
1 2 3 4 5 |
ssh-keygen -t rsa ssh-copy-id root@192.168.0.51 ssh-copy-id root@192.168.0.52 ssh-copy-id root@192.168.0.53 ssh-copy-id root@192.168.0.54 |
进入51服务器, 打通虚拟机51到其他服务器的信道
1 2 3 4 5 6 |
ssh-keygen -t rsa cd /root/.ssh cat id_rsa.pub >> authorized_keys ssh-copy-id root@192.168.0.52 ssh-copy-id root@192.168.0.53 ssh-copy-id root@192.168.0.54 |
从51服务器拿到ssh配置信息:
1 |
scp root@192.168.0.51:/root/.ssh/* /mnt/E/4_开发软件/Hadoop/ssh5/ |
全部安装基础工具:
1 |
yum -y install vim wget ntp |
全部关闭防火墙:
1 2 3 4 5 6 7 |
//关闭iptables service iptables stop chkconfig iptables off service iptables status //关闭selinux vim /etc/selinux/config SELINUX=disabled |
禁用IP6:
1 2 3 |
vim /etc/sysctl.conf 然后添加以下内容: # 禁用整个系统所有接口的IPv6 net.ipv6.conf.all.disable_ipv6 = 1 |
每个系统追加配置hosts:
1 2 3 4 5 6 7 |
vim /etc/hosts //禁用IPV6,将/etc/hosts文件里面的::1 localhost那一行删掉重启 //然后加入 192.168.0.51 H1 192.168.0.52 H2 192.168.0.53 H3 192.168.0.54 H4 |
配置时钟同步:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
vim /etc/ntp.conf //===============================//51服务器 //restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap restrict 192.168.0.0 mask 255.255.255.0 nomodify notrap //加入 #启动 service ntpd start #开机启动 chkconfig ntpd on #状态 service ntpd status //===============================//其他服务器 关闭网络并加入 server 192.168.0.51 //加入 |
重启所有系统,然后测试时钟同步 关于时钟更多信息 http://cn.linux.vbird.org/linux_server/0440ntp.php
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
//52,53,54客户端测试 ntpdate 192.168.0.51 //输出类似表示成功, 如果不成功, 就多次重启51服务器的ntp服务: service ntpd restart 4 Nov 09:21:22 ntpdate[1222]: adjust time server 192.168.0.51 offset -0.201301 sec //查看客户机器同步状态 #开启服务 service ntpd start #开机启动 chkconfig ntpd on #检查状态 ntpstat #输出类似表示成功 synchronised to NTP server (192.168.0.51) at stratum 4 time correct to within 8100 ms polling server every 64 s |
1: 创建本地库
1 2 3 4 |
//51服务器 yum -y install httpd createrepo yum-utils service httpd start chkconfig httpd on |
测试: 访问 http://192.168.0.51/ 能打开,表示成功
1 2 |
//创建目录 mkdir -p /var/www/html/hdp |
本机传文件到hadoop的51服务器, 用来做本地库和ambar-server的服务器
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
//物理机上传 scp ambari-2.2.1.0-centos6.tar.gz root@192.168.0.51:/var/www/html/hdp scp HDP-2.3.0.0-centos6-rpm.tar.gz root@192.168.0.51:/var/www/html/hdp scp HDP-UTILS-1.1.0.20-centos6.tar.gz root@192.168.0.51:/var/www/html/hdp scp jdk-7u67-linux-x64.rpm root@192.168.0.51:/var/www/html/hdp //到51服务器去, //安装jdk cd /var/www/html/hdp rpm -ivh jdk-7u67-linux-x64.rpm //测试 java -version //输出, 表示成功 java version "1.7.0_67" Java(TM) SE Runtime Environment (build 1.7.0_67-b01) Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode) //解压所有上传的tar.gz文件 cd /var/www/html/hdp tar zxvf ambari-2.2.1.0-centos6.tar.gz tar zxvf HDP-2.3.0.0-centos6-rpm.tar.gz tar zxvf HDP-UTILS-1.1.0.20-centos6.tar.gz |
创建见本地库
1 2 |
//基于html的创建源, 在html的目录下,执行 createrepo hdp |
执行完成后,我们可以看到hdp目录会多一个repodata的新目录。
2: 安装Ambari Server
准备本地库的repo信息, 先在本机创建, 然后在复制到所有hadoop和ambari server服务器
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
cd /etc/yum.repos.d vim hdp.repo [HDP-2.3.0.0] name=HDPVersion-HDP-2.3.0.0 baseurl=http://192.168.0.51/hdp/HDP/centos6/2.x/updates/2.3.0.0 gpgcheck=1 #gpgkey=http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.3.0.0/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins gpgkey=http://192.168.0.51/hdp/HDP/centos6/2.x/updates/2.3.0.0/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins enabled=1 priority=1 vim ambari.repo [Updates-ambari-2.2.1.0] name=ambari-2.2.1.0-Updates baseurl=http://192.168.0.51/hdp/AMBARI-2.2.1.0/centos6/2.2.1.0-161 gpgcheck=1 #gpgkey=http://public-repo-1.hortonworks.com/ambari/centos6/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins gpgkey=http://192.168.0.51/hdp/AMBARI-2.2.1.0/centos6/2.2.1.0-161/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins enabled=1 priority=1 vim hdp-util.repo [HDP-UTILS-1.1.0.20] name=HDPUtilsVersion-HDP-UTILS-1.1.0.20 baseurl=http://192.168.0.51/hdp/HDP-UTILS-1.1.0.20/repos/centos6 gpgcheck=1 #gpgkey=http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.3.0.0/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins gpgkey=http://192.168.0.51/hdp/HDP-UTILS-1.1.0.20/repos/centos6/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins enabled=1 priority=1 然后传到所有虚拟机的服务器 scp *.repo root@192.168.0.51:/etc/yum.repos.d scp *.repo root@192.168.0.52:/etc/yum.repos.d scp *.repo root@192.168.0.53:/etc/yum.repos.d scp *.repo root@192.168.0.54:/etc/yum.repos.d |
上传必要的文件到51服务器
1 2 3 4 5 6 7 8 9 10 11 |
//51服务器创建 mkdir -p /var/lib/ambari-server/resources/ //本地上传到51服务器 scp jdk-7u67-linux-x64.tar.gz root@192.168.0.51:/var/lib/ambari-server/resources/ scp UnlimitedJCEPolicyJDK7.zip root@192.168.0.51:/var/lib/ambari-server/resources/ //到51服务器安装 yum install ambari-server //在线安装 ambari-server setup //首次配置, 选择jdk7, 选择数据库postgresql, 账户和密码等信息都使用:ambari ambari-server start //启动 |
3: 创建集群
A: Select Stack 步奏, 选择2.3版本, Advanced Repository Options选项选择redhat6, 修改两个url为:
http://192.168.0.51/hdp/HDP/centos6/2.x/updates/2.3.0.0
http://192.168.0.51/hdp/HDP-UTILS-1.1.0.20/repos/centos6
同时勾选Skip Repository Base URL validation (Advanced),
然后下一步配置Target Hosts: H2 H3 H4(分三行录入), 并上传在51服务器做打通信道得到的配置文件id_rsa, 然后开始Confirm Hosts. 如果成功confirm的话, 表格的Status是绿色.
B: 注意是否有警告, 下一步
C: 选择带集群的机器, 这一步到底是选择3个服务器还是4个服务器? 应该选择4个.
D: Choose Services 选择:
基础: HDFS, YARN + MapReduce2, ZooKeeper, Ambari Metrics, Spark
hive: Tez, Hlive, Pig
D: 配置集群服务器结构
E: 配置Assign Slaves and Clients, 选择52做NodeManager, 其他服务器选择DataNode,Client
F: Customize Services , 这里提示输入hive密码
G: 最后Deploy部署, 如果没有错误, 安装成功
4: 安装后修改配置:
A: 修改服务器字符编码: 修改为 en_US.UTF_8. 最好所有服务器都修改.
1 2 3 |
vim /etc/sysconfig/i18n 修改编码格式为 en_US.UTF_8 source /etc/sysconfig/i18n |
B: 修改几个配置信息
* In Ambari Web, browse to Services > YARN > Configs –>* Filter for the yarn.timeline-service.store-class property and set to org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore value. –> * Save the configuration change and restart YARN.
* In Ambari Web, browse to Services > HDFS > Configs –>* Filter for the dfs.permissions property and set to false value. –> * Save the configuration change and restart YARN.
* TEZ / Configs / Advanced tez-site –> Locate the configuration ” tez.history.logging.service.class” –> Replace the value “org.apache.tez.dag.history.logging.ats.ATSV15HistoryLoggingService” with the new value: “org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService”
C: 重启所有Service.
测试:
1. 安装eclipse插件. 下载插件让如plugins文件夹, 下载hadoop 2.7版本, 解压到一个路径. 配置eclipse, 到底应该要什么账户: 服务器的root账户? 服务器的hdfs账户, 还是本机登录用户? user name要跟虚拟机里运行hadoop的用户名一致,我是用hadoop身份安装运行hadoop 2.6.0的,所以这里填写hadoop,如果你是用root安装的,相应的改成root.
D: 执行, 要求切换到su hsfs, 可是没存在这个账户. 原文写错, 应该是:su hdfs ;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
//本地上传到服务器 scp /home/pandy/workspace/HadoopApp/words_01.txt root@192.168.0.31:/tmp/input/words_01.txt //服务器传到hdfs文件系统 su hdfs hadoop fs -mkdir /tmp/input hadoop fs -mkdir /tmp/output hadoop fs -put /tmp/words_01.txt /tmp/input/words_01.txt cd /usr/hdp/2.3.0.0-2557/hadoop-mapreduce/ //执行 hadoop jar hadoop-mapreduce-examples-2.7.1.2.3.0.0-2557.jar wordcount /tmp/input/words_01.txt /tmp/output/1007_01 //=================================以下是输出 WARNING: Use "yarn jar" to launch YARN applications. 16/11/01 11:07:32 INFO impl.TimelineClientImpl: Timeline service address: http://h31:8188/ws/v1/timeline/ 16/11/01 11:07:32 INFO client.RMProxy: Connecting to ResourceManager at h32/192.168.0.32:8050 16/11/01 11:07:45 INFO input.FileInputFormat: Total input paths to process : 1 16/11/01 11:07:46 INFO mapreduce.JobSubmitter: number of splits:1 16/11/01 11:07:47 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1477967992073_0001 16/11/01 11:07:48 INFO impl.YarnClientImpl: Submitted application application_1477967992073_0001 16/11/01 11:07:48 INFO mapreduce.Job: The url to track the job: http://h32:8088/proxy/application_1477967992073_0001/ 16/11/01 11:07:48 INFO mapreduce.Job: Running job: job_1477967992073_0001 16/11/01 11:08:23 INFO mapreduce.Job: Job job_1477967992073_0001 running in uber mode : false 16/11/01 11:08:23 INFO mapreduce.Job: map 0% reduce 0% 16/11/01 11:08:45 INFO mapreduce.Job: map 100% reduce 0% 16/11/01 11:09:08 INFO mapreduce.Job: map 100% reduce 100% 16/11/01 11:09:10 INFO mapreduce.Job: Job job_1477967992073_0001 completed successfully //服务器查看执行结果 hadoop fs -cat /tmp/output/1007_01/part-r-00000 //====================输出 |
执行完后, 可以去看执行结果.
E: 查看job执行结果: http://192.168.0.32:19888/jobhistory
2. 例子:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
package com.first; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCountEx { public static void main(String[] args) throws Exception { //配置信息 Configuration conf = new Configuration(); //job名称 Job job = Job.getInstance(conf, "mywordcount"); job.setJarByClass(WordCountEx.class); job.setMapperClass(MyMapper.class); // job.setCombinerClass(IntSumReducer.class); job.setReducerClass(MyReduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); //输入、输出path FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); //结束 System.exit(job.waitForCompletion(true) ? 0 : 1); } /** * 做了点DIY,排除了字母长度小于5的数据,方便大家对比理解程序。 * @author pandy * */ static class MyMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); @Override protected void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException { // 分割字符串 StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { // 排除字母少于5个的 String tmp = itr.nextToken(); if (tmp.length() < 5) continue; word.set(tmp); context.write(word, one); } } } /** * 我们将map的结果乘以2,然后输出的内容的key加了个前缀。 * @author pandy * */ static class MyReduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); private Text keyEx = new Text(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { // 将map的结果放大,乘以2 sum += val.get() * 2; } result.set(sum); // 自定义输出key keyEx.set("输出:" + key.toString()); context.write(keyEx, result); } } } |
打包得到HadoopTest.jar, 上传到31服务器的/var/tmp
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
//从开发本机上传 scp HadoopTest.jar root@192.168.0.31:/var/tmp //在31服务器执行 cd /var/tmp //执行, 原文的这里没有说对, 这里是修改过后的命令 yarn jar HadoopTest.jar com.first.WordCountEx /tmp/input/words_01.txt /tmp/output/1007_03 //输出 16/11/01 11:56:49 INFO impl.TimelineClientImpl: Timeline service address: http://h31:8188/ws/v1/timeline/ 16/11/01 11:56:50 INFO client.RMProxy: Connecting to ResourceManager at h32/192.168.0.32:8050 16/11/01 11:56:52 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 16/11/01 11:56:53 INFO input.FileInputFormat: Total input paths to process : 1 16/11/01 11:56:54 INFO mapreduce.JobSubmitter: number of splits:1 16/11/01 11:56:55 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1477967992073_0003 16/11/01 11:56:55 INFO impl.YarnClientImpl: Submitted application application_1477967992073_0003 16/11/01 11:56:56 INFO mapreduce.Job: The url to track the job: http://h32:8088/proxy/application_1477967992073_0003/ 16/11/01 11:56:56 INFO mapreduce.Job: Running job: job_1477967992073_0003 16/11/01 11:57:16 INFO mapreduce.Job: Job job_1477967992073_0003 running in uber mode : false 16/11/01 11:57:16 INFO mapreduce.Job: map 0% reduce 0% ...... 16/11/01 11:57:43 INFO mapreduce.Job: map 100% reduce 0% 16/11/01 11:57:57 INFO mapreduce.Job: map 100% reduce 100% 16/11/01 11:57:58 INFO mapreduce.Job: Job job_1477967992073_0003 completed successfully . |
3: 测试Hive
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
//传文件到服务器 scp /home/pandy/workspace/HadoopApp/hive.txt root@192.168.0.31:/tmp/hive.txt hadoop fs -put /tmp/hive.txt /tmp/input/hive.txt //进入H2服务器 su hdfs hive //进入hive环境 //建表, 如果没有设定,设定字段标记和行标记,导入后很多字段是NULL, 包括bigint和int等数字类型导入后是NULL create table t( id bigint, name string, idx int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; show tables; //查看是否创建成功 //导入 load data inpath '/tmp/input/hive.txt' overwrite into table t; //查询 select * from t; //=====================输出 1 eZkvtxiv 8348 2 lHfcqlIy 4119 3 OfdLeKDe 4681 4 tUvGFNjp 452 5 qOTRePCg 79 6 AxFASxiz 4178 7 WoXVlNSj 1034 .... .... //导出到本地 insert overwrite local directory '/root/dev/t' select * from t; 提示: Copying data to local directory /root/dev/t 但是为什么到这个路径去看不到文件? //导出到hdfs insert overwrite directory '/tmp/output/t' select * from t; //查看 hadoop fs -ls -R /tmp/output //看到: drwxr-xr-x - hdfs hdfs 0 2016-11-03 10:05 /tmp/output/t -rwxr-xr-x 3 hdfs hdfs 17802 2016-11-03 10:05 /tmp/output/t/000000_0 //表示成功 //查看内容, 没有制表 很奇怪 hadoop fs -cat /tmp/output/t/000000_0 1eZkvtxiv8348 2lHfcqlIy4119 3OfdLeKDe4681 4tUvGFNjp452 5qOTRePCg79 //设定导出制表符号 insert overwrite directory '/tmp/output/t' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select * from t ; //查看导出 hadoop fs -cat /tmp/output/t/000000_0 1,eZkvtxiv,8348 2,lHfcqlIy,4119 3,OfdLeKDe,4681 |
4: 使用jdbc方式访问
问题:
A: 输入hive命令,开始初始化出现错误: SessionNotRunning: TezSession has already shutdown , 解决方式: https://discuss.pivotal.io/hc/en-us/articles/221806047-Hive-Cli-fails-to-start-with-the-message-TezSession-has-already-shutdown-
B: hive无法导入本地数据, 因为报错,先不解决.
C: 把文件放到hdfs里面, 然后在导入到hive(用制表\t符号隔开), 去hive查看表, 都是NULL数据, 何解? 看这里提示: http://jingyan.baidu.com/article/624e7459b705f734e8ba5a1d.html hive的数据导入与数据导出:http://blog.csdn.net/longshenlmj/article/details/41519503
学习带来乐趣,谢谢博主!