本篇总结HDFS在windows操作系统Java环境下的API操作。

HDFS的Java_api操作

配置Windows下的 Hadoop环境

在Windows系统需要配置Hadoop运行环境，相当于Windows是一个Hadoop客户端。

不配置而直接运行代码会出现以下问题：

1	Could not locate executable null\bin\winutils.exe in the hadoop binaries

原因： 缺少winutils.exe

1	Unable to load native-hadoop library for your platform..using builtin-Java classes where applicable

原因： 缺少hadoop.dll

解决

1、首先下载Hadoop在Windows上的工具包，下载地址：https://github.com/steveloughran/winutils 。

2、得到后，将其解压到一个无中文无空格的目录下，并配置环境变量。

Path下添加：%HADOOP_HOME%\bin

3、将下载的包内的 hadoop.dll 拷贝一份到 C:\Windows\System32 目录下。

4、重启电脑，完成。

导入Maven依赖

<dependencies>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>2.7.7</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>2.7.7</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-hdfs</artifactId>
        <version>2.7.7</version>
    </dependency>

    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-core</artifactId>
        <version>2.7.7</version>
    </dependency>

    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>RELEASE</version>
    </dependency>
</dependencies>

<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.2</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
                <encoding>UTF-8</encoding>
            </configuration>
        </plugin>

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>2.4.3</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <minimizeJar>true</minimizeJar>
                    </configuration>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

使用URL方式访问数据（了解）

@Test
public void urlHDFS() throws IOException {
    // 注册URL
    URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
    //获取hdfs文件的输入流
    InputStream inputStream = new URL("hdfs://bigdata1:8020/a.txt").openStream();
    // 获取本地文件的输出流
    FileOutputStream fileOutputStream = new FileOutputStream(new File("D:\\hello.txt"));
    // 实现文件的拷贝
    IOUtils.copy(inputStream, fileOutputStream);
    // 关流
    IOUtils.closeQuietly(inputStream);
    IOUtils.closeQuietly(fileOutputStream);
}

使用文件系统方式访问数据（掌握）

涉及的主要类

在Java中操作HDFS，主要涉及以下Class：

Configuration：该类的对象封装了客户端或者服务器的配置
FileSystem：该类的对象是一个文件系统对象，可以用该对象的一些方法来对文件进行操作，通过FileSystem的静态方法 get 获得该对象
1
FileSystem fs = FileSystem.get(conf) # 就是Configuration类的对象
- get 方法从 conf 中的一个参数 fs.defaultFS 的配置值判断具体是什么类型的文件系统
- 如果我们的代码中没有指定fs.defaultFS，并且工程ClassPath下也没有给定相应的配置，conf中的默认值来自于Hadoop的Jar包中的core-default.xml
- 默认值为file:///，则获取的不是一个DistributedFileSystem的实例，而是一个本地文件系统的客户端对象

获取FileSystem的四种方式

第一种

@Test
public void getFileSystem1() throws IOException {
    // 创建一个Configuration对象，封装
    Configuration configuration = new Configuration();
    // 设置文件系统类型
    configuration.set("fs.defaultFS", "hdfs://bigdata1:8020");
    // 获取指定的文件系统
    FileSystem fileSystem = FileSystem.get(configuration);
    // 输出 DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_373282607_1, ugi=11655 (auth:SIMPLE)]]
    System.out.println(fileSystem);
}

第二种

@Test
public void getFileSystem2() throws IOException, URISyntaxException {
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://bigdata1:8020"), new Configuration());
    System.out.println(fileSystem);
}

第三种

@Test
public void getFileSystem3() throws IOException {
    // 创建一个Configuration对象
    Configuration configuration = new Configuration();
    // 设置文件系统类型
    configuration.set("fs.defaultFS", "hdfs://bigdata1:8020");
    // 获取指定的文件系统
    FileSystem fileSystem = FileSystem.newInstance(configuration);
    // 输出 DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_373282607_1, ugi=11655 (auth:SIMPLE)]]
    System.out.println(fileSystem);
}

第四种

@Test
public void getFileSystem4() throws IOException, URISyntaxException {
    FileSystem fileSystem = FileSystem.newInstance(new URI("hdfs://bigdata1:8020"), new Configuration());
    System.out.println(fileSystem);
}

注意：1、3比较相似，2、4比较相似，主要是get方法和 newInstance方法的使用

这里对Configuration参数对象的加载机制作出解释：

首先构造时会加载jar包的默认配置，如：xxx-default.xml
再加载用户配置的文件（必须放在resources资源目录下），如自定义的hdfs-site.xml
最后可以手动设置，覆盖之前的相同有关配置：configuration.set(“dfs.blocksize”, “64m”);

遍历HDFS所有文件信息

@Test
public void listFiles() throws URISyntaxException, IOException {
    // 获取FileSystem实例
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://bigdata1:8020"), new Configuration());
    // 调用方法listFiles 获取一个目录下的文件信息，为一个迭代器对象
    // 第一个参数：指定目录
    // 第二个参数，是否迭代获取
    RemoteIterator<LocatedFileStatus> iterator = fileSystem.listFiles(new Path("/"), true);
    //遍历迭代器，获取文件的详细信息
    while (iterator.hasNext()){
        LocatedFileStatus fileStatus = iterator.next();
        // 获取文件的绝对路径："hdfs://bigdata1:8020/xxx"
        System.out.println(fileStatus.getPath() + "  ---  " + fileStatus.getPath().getName());

        //文件的Block信息
        BlockLocation[] blockLocations = fileStatus.getBlockLocations();
        System.out.println("Block数：" + blockLocations.length);
    }
}

HDFS创建文件夹

@Test
public void mkdirs() throws URISyntaxException, IOException {
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://bigdata1:8020/a.txt"), new Configuration());
    // 创建文件夹
    boolean bl = fileSystem.mkdirs(new Path("/aaa/bbb/ccc"));
    //创建文件
    fileSystem.create(new Path("/aaa/aaa.txt"));
    // 两个创建方法都为递归创建
    System.out.println(bl);
    fileSystem.close();
}

文件的下载

@Test
public void downloadFile() throws URISyntaxException, IOException {
    // 获取FileSystem
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://bigdata1:8020/a.txt"), new Configuration());

    // 获取hdfs的输入流 
    FSDataInputStream inputStream = fileSystem.open(new Path("/a.txt"));
    // 获取本地路径的输出流 
    FileOutputStream outputStream = new FileOutputStream("D://a.txt");
    // 文件的拷贝 
    IOUtils.copy(inputStream, outputStream);
    // 关闭流 
    IOUtils.closeQuietly(inputStream); 
    IOUtils.closeQuietly(outputStream); 
    fileSystem.close(); 
}
/*
 * 实现文件的下载 2
 * */
@Test
public void downloadFile2() throws URISyntaxException, IOException {
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://bigdata1:8020/a.txt"), new Configuration());
    fileSystem.copyToLocalFile(new Path("/a.txt"), new Path("D://a.txt"));
    fileSystem.close();
}

文件的上传

@Test
public void uploadFile() throws URISyntaxException, IOException {
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://bigdata1:8020/a.txt"), new Configuration());
    fileSystem.copyFromLocalFile(new Path("D://b.txt"),new Path("/"));
    fileSystem.close();
}

HDFS的权限访问控制

首先进入Hadoop的安装目录下的/etc/hadoop/hdfs-site.xml，修改permission为true，代表启动权限。启动后通过命令行的权限修改才能生效，修改配置文件需要重启才能生效。

1	hdfs dfs -chmod 000 /a.txt

数字代表权限等级，当开启权限控制时，文件会有其对应的Owner，不是相应的Owner仍然无法访问资源。这时我们可以在get方法内指定伪装用户对资源进行访问：

1	FileSystem fileSystem = FileSystem.get(new URI("hdfs://bigdata1:8020/a.txt"), new Configuration(), "root");

小文件合并

由于Hadoop擅长存储大文件，因为大文件的元数据信息比较少。如果集群中有大量的小文件，则需要维护大量的元数据，增大内存压力。所以有必要将小文件合并成大文件一起处理。

在HDFS的Shell命令下，可以用如下命令讲很多HDFS文件合并成一个大文件下载到本地

1 2	cd /export/servers hdfs dfs -getmerge /config/*.xml ./hello.xml # 表示合并文件，下载到当前目录下的hello.xml

同样也可以在上传时将小文件合并到一个大文件里面去

@Test
public void mergeFile() throws URISyntaxException, IOException, InterruptedException {
    // 获取FileSystem
    FileSystem fileSystem = FileSystem.get(new URI("hdfs://bigdata1:8020/a.txt"), new Configuration(), "root");
    // 获取hdfs大文件的输出流，创建一个承载所有内容的大文件
    FSDataOutputStream outputStream = fileSystem.create(new Path("/big.txt"));
    // 获取一个本地文件系统
    LocalFileSystem localFileSystem = FileSystem.getLocal(new Configuration());
    // 获取本地文件夹下所有文件的详情,input是提前准备的文件夹，里面有一些小文件
    FileStatus[] fileStatuses = localFileSystem.listStatus(new Path("D:\\input"));
    // 遍历每个文件，获得每个文件的输入流
    for (FileStatus fileStatus : fileStatuses) {
        FSDataInputStream inputStream = localFileSystem.open(fileStatus.getPath());
        // 将小文件的数据复制到大文件
        IOUtils.copy(inputStream, outputStream);
        IOUtils.closeQuietly(inputStream);
    }
    // 关闭流
    IOUtils.closeQuietly(outputStream);
    localFileSystem.close();
    fileSystem.close();
}

HDFS的高可用机制

在Hadoop2.X之前，Namenode是HDFS集群中可能发生单点故障的节点，每个HDFS集群只有一个namenode，一旦这个节点不可用，则整个HDFS集群将处于不可用状态。
HDFS高可用（HA）方案就是为了解决上述问题而产生的，在HA HDFS集群中会同时运行两个Namenode，一个作为活动的Namenode（Active），一个作为备份的Namenode（Standby）。备份的Namenode的命名空间与活动的Namenode是实时同步的，所以当活动的Namenode发生故障而停止服务时，备份Namenode可以立即切换为活动状态，而不影响HDFS集群服务。

详情：https://blog.csdn.net/u012736748/article/details/79534019

大数据_04(HDFS_API操作)