赞
踩
实验环境:
docker pull jupyter/pyspark-notebook
docker run --name pyspark --rm -v D:\spark:/home/jovyan -p 8888:8888 jupyter/pyspark-notebook

在浏览器打开网址即可
相关教程
User Guide — PySpark 3.2.1 documentation (apache.org)
参考教程:
GitHub - sshaik/bitnami-docker-spark: Bitnami Docker Image for Apache Spark
使用 Docker 快速部署 Spark + Hadoop 大数据集群 - 知乎 (zhihu.com)
使用 Docker 快速部署 Spark + Hadoop 大数据集群 | 芥子屋 (s1mple.cc)
docker pull bitnami/spark:3
PS C:\Windows\system32> docker-compose -v
Docker Compose version v2.17.2
在本地新建一个目录创建docker-compose.yml (任意目录均可)

docker-compose.yml
version: '2' services: spark: image: docker.io/bitnami/spark:3 hostname: master environment: - SPARK_MODE=master - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no volumes: - D:/Docker/spark/share:/opt/share ports: - '8080:8080' - '4040:4040' spark-worker-1: image: docker.io/bitnami/spark:3 hostname: worker1 environment: - SPARK_MODE=worker - SPARK_MASTER_URL=spark://master:7077 - SPARK_WORKER_MEMORY=1G - SPARK_WORKER_CORES=1 - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no volumes: - D:/Docker/spark/share:/opt/share ports: - '8081:8081' spark-worker-2: image: docker.io/bitnami/spark:3 hostname: worker2 environment: - SPARK_MODE=worker - SPARK_MASTER_URL=spark://master:7077 - SPARK_WORKER_MEMORY=1G - SPARK_WORKER_CORES=1 - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no volumes: - D:/Docker/spark/share:/opt/share ports: - '8082:8081'
version :compose版本号
hostname:容器实例主机名;
volumes:挂载本地目录 D:/Docker/spark/share 到容器目录 /opt/share;
ports:开放 4040 和从节点 Spark Web UI 端口
返回PowerShell界面,进入docker-compose.yml所在的目录
cd D:\Docker\spark
PS D:\Docker\spark> docker-compose up -d
[+] Running 3/3
✔ Container spark-spark-1 Started 1.9s
✔ Container spark-spark-worker-2-1 Started 2.4s
✔ Container spark-spark-worker-1-1 Started

在浏览器打开 http://localhost:8080/

PS D:\Docker\spark> docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b905de0c118b bitnami/spark:3 "/opt/bitnami/script…" 2 hours ago Up 12 minutes 0.0.0.0:8081->8081/tcp spark-spark-worker-1-1
33be43384bf3 bitnami/spark:3 "/opt/bitnami/script…" 2 hours ago Up 12 minutes 0.0.0.0:4040->4040/tcp, 0.0.0.0:8080->8080/tcp spark-spark-1
47c4bf45f4ca bitnami/spark:3 "/opt/bitnami/script…" 2 hours ago Up 12 minutes 0.0.0.0:8082->8081/tcp spark-spark-worker-2-1
进入到 master 容器内部
PS D:\Docker\spark> docker exec -it 33be43384bf3 /bin/bash
pyspark

查找对应hadoop安装包的下载路径,例如:
https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/stable/hadoop-3.3.5.tar.gz
GitHub - s1mplecc/spark-hadoop-docker
新建一个config目录,创建相应的文件(从上述链接下载)

查看config
PS D:\Docker\spark> tree /F config
编写 Hadoop 启动脚本。
由于设置了 ssh 免密通信,首先需要启动 ssh 服务,然后依次启动 HDFS 和 YARN 集群。


#!/bin/bash
service ssh start
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
在工作目录下,创建用于构建新镜像的 Dockerfile
$HADOOP_CONF_DIR 目录下的 Hadoop 配置文件;FROM docker.io/bitnami/spark:3 LABEL maintainer="s1mplecc <s1mple951205@gmail.com>" LABEL description="Docker image with Spark (3.1.2) and Hadoop (3.2.0), based on bitnami/spark:3. \ For more information, please visit https://github.com/s1mplecc/spark-hadoop-docker." USER root ENV HADOOP_HOME="/opt/hadoop" ENV HADOOP_CONF_DIR="$HADOOP_HOME/etc/hadoop" ENV HADOOP_LOG_DIR="/var/log/hadoop" ENV PATH="$HADOOP_HOME/hadoop/sbin:$HADOOP_HOME/bin:$PATH" WORKDIR /opt RUN apt-get update && apt-get install -y openssh-server RUN ssh-keygen -t rsa -f /root/.ssh/id_rsa -P '' && \ cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys RUN curl -OL https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/stable/hadoop-3.3.5.tar.gz RUN tar -xzvf hadoop-3.3.5.tar.gz && \ mv hadoop-3.3.5 hadoop && \ rm -rf hadoop-3.3.5.tar.gz && \ mkdir /var/log/hadoop RUN mkdir -p /root/hdfs/namenode && \ mkdir -p /root/hdfs/datanode COPY config/* /tmp/ RUN mv /tmp/ssh_config /root/.ssh/config && \ mv /tmp/hadoop-env.sh $HADOOP_CONF_DIR/hadoop-env.sh && \ mv /tmp/hdfs-site.xml $HADOOP_CONF_DIR/hdfs-site.xml && \ mv /tmp/core-site.xml $HADOOP_CONF_DIR/core-site.xml && \ mv /tmp/mapred-site.xml $HADOOP_CONF_DIR/mapred-site.xml && \ mv /tmp/yarn-site.xml $HADOOP_CONF_DIR/yarn-site.xml && \ mv /tmp/workers $HADOOP_CONF_DIR/workers COPY start-hadoop.sh /opt/start-hadoop.sh RUN chmod +x /opt/start-hadoop.sh && \ chmod +x $HADOOP_HOME/sbin/start-dfs.sh && \ chmod +x $HADOOP_HOME/sbin/start-yarn.sh RUN hdfs namenode -format RUN sed -i "1 a /etc/init.d/ssh start > /dev/null &" /opt/bitnami/scripts/spark/entrypoint.sh ENTRYPOINT [ "/opt/bitnami/scripts/spark/entrypoint.sh" ] CMD [ "/opt/bitnami/scripts/spark/run.sh" ]
在工作目录下,执行如下命令构建镜像:
PS D:\Docker\spark> docker build -t s1mplecc/spark-hadoop:3 .
构建镜像完成后,还需要修改 docker-compose.yml 文件,使其从新的镜像 s1mplecc/spark-hadoop:3中启动容器集群,同时映射 Hadoop Web UI 端口。
拉取镜像
docker pull s1mplecc/spark-hadoop:3
修改docker-compose.yml
version: '2' services: spark: image: s1mplecc/spark-hadoop:3 hostname: master environment: - SPARK_MODE=master - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no volumes: - D:/Docker/spark/share:/opt/share ports: - '8080:8080' - '4040:4040' - '8088:8088' - '8042:8042' - '9870:9870' - '19888:19888' spark-worker-1: image: s1mplecc/spark-hadoop:3 hostname: worker1 environment: - SPARK_MODE=worker - SPARK_MASTER_URL=spark://master:7077 - SPARK_WORKER_MEMORY=1G - SPARK_WORKER_CORES=1 - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no volumes: - D:/Docker/spark/share:/opt/share ports: - '8081:8081' spark-worker-2: image: s1mplecc/spark-hadoop:3 hostname: worker2 environment: - SPARK_MODE=worker - SPARK_MASTER_URL=spark://master:7077 - SPARK_WORKER_MEMORY=1G - SPARK_WORKER_CORES=1 - SPARK_RPC_AUTHENTICATION_ENABLED=no - SPARK_RPC_ENCRYPTION_ENABLED=no - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no - SPARK_SSL_ENABLED=no volumes: - D:/Docker/spark/share:/opt/share ports: - '8082:8081'
运行 docker-compose 启动命令重建集群,不需要停止或删除旧集群。
docker-compose up -d
进入主节点
docker ps
docker exec -it 322da32f44cb bash
./start-hadoop.sh
有警告不影响启动
注:如果测试过程中出现问题,可以考虑把GitHub - s1mplecc/spark-hadoop-docker安装包里的相关内容都复制到相应本地目录下
使用命令将共享文件中的 words.txt 写入 HDFS:
hadoop fs -put /opt/share/words.txt /
hdfs dfs -ls /
有警告不影响
http://localhost:9870

Spark 访问 HDFS
pyspark
>>>lines = sc.textFile('hdfs://master:9000/words.txt')
>>>lines.collect()
>>>words = lines.flatMap(lambda x: x.split(' '))
>>>words.saveAsTextFile('hdfs://master:9000/split-words.txt')
>>>exit()
HDFS 上的文件被读取为 RDD,在内存上进行 Transformation 后写入 HDFS。写入的文件被存储到 HDFS 的 DataNode 块分区上。
hdfs dfs -ls /
hdfs dfs -ls /split-words.txt
hdfs dfs -cat /split-words.txt/part-00000
将 Spark 应用提交到 YARN 集群
spark-submit --master yarn --deploy-mode cluster --name "Word Count" --executor-memory 1g --class org.apache.spark.examples.JavaWordCount /opt/bitnami/spark/examples/jars/spark-examples_2.12-3.2.0.jar /word.txt
http://localhost:8088/cluster/apps/

| Web UI | 默认网址 | 备注 |
|---|---|---|
| * Spark Application | http://localhost:4040 | 由 SparkContext 启动,显示以本地或 Standalone 模式运行的 Spark 应用 |
| Spark Standalone Master | http://localhost:8080 | 显示集群状态,以及以 Standalone 模式提交的 Spark 应用 |
| * HDFS NameNode | http://localhost:9870 | 可浏览 HDFS 文件系统 |
| * YARN ResourceManager | http://localhost:8088 | 显示提交到 YARN 上的 Spark 应用 |
| YARN NodeManager | http://localhost:8042 | 显示工作节点配置信息和运行时日志 |
| MapReduce Job History | http://localhost:19888 | MapReduce 历史任务 |
注:星号标注的为较常用的 Web UI。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。