Trino on k8s 编排部署进阶篇-六虎

一、概述

Trino on Kubernetes（Trino在Kubernetes上的布置）是将Trino查询引擎与Kubernetes容器编列平台相结合，以实现在Kubernetes集群上布置、办理和运转Trino的解决方案。

Trino（之前称为Presto SQL）是一个高功能的分布式SQL查询引擎，旨在处理大规模数据集和杂乱查询。Kubernetes是一个盛行的开源容器编列平台，用于自动化容器的布置、扩展和办理。

将Trino布置在Kubernetes上能够带来一些优势：

弹性扩展：Kubernetes供给了自动化的容器扩展功能，能够依据作业负载的需求自动添加或削减Trino的实例数。这样，能够依据查询负载的改变进行弹性伸缩，提高功能和资源利用率。
高可用性：Kubernetes具有容错和故障恢复的能力。经过在Kubernetes集群中布置多个Trino实例，能够实现高可用性架构，当其中一个实例失败时，其他实例能够接管作业，保证体系的可用性。
资源办理：Kubernetes供给了资源调度和办理的功能，能够操控Trino实例运用的计算资源、存储资源和网络资源。经过恰当装备资源限制和请求，能够有效地办理Trino查询的资源耗费，避免资源冲突和争用。
简化布置和办理：Kubernetes供给了声明性的装备和自动化的布置机制，能够简化Trino的布置和办理进程。经过运用Kubernetes的规范工具和API，能够轻松地进行Trino实例的创立、装备和监控。
生态体系整合：Kubernetes具有丰厚的生态体系和集成能力，能够与其他工具和平台进行无缝集成。例如，能够与存储体系（如Hadoop HDFS、Amazon S3）和其他数据处理工具（如Apache Spark）集成，实现数据的无缝访问和处理。

需求留意的是，将Trino布置在Kubernetes上需求恰当的装备和调优，以保证功能和可靠性。此外，关于大规模和杂乱的查询场景，或许需求考虑数据分片、数据划分和数据本地性等方面的优化。

总之，Trino on Kubernetes供给了一种灵敏、可扩展和高效的方法来布置和办理Trino查询引擎，使其能够更好地适应大数据环境中的查询需求。

这儿只是讲解布置进程，想了解更多的trino的内容，可参阅我以下几篇文章：

大数据Hadoop之——基于内存型SQL查询引擎Presto（Presto-Trino环境布置）
【大数据】Presto（Trino）SQL 语法进阶
【大数据】Presto（Trino）REST API 与履行计划介绍
【大数据】Presto（Trino）装备参数以及 SQL语法

假如想单机容器布置，能够参阅我这篇文章：【大数据】经过 docker-compose 快速布置 Presto（Trino）保姆级教程

二、k8s 布置布置

k8s 环境布置这儿不重复讲解了，重点是 Hadoop on k8s，不知道怎样布置k8s环境的能够参阅我以下几篇文章：

【云原生】k8s 环境快速布置（一小时以内布置完）
【云原生】k8s 离线布置讲解和实战操作

三、开端编列布置 Trino

1）构建镜像 Dockerfile

FROM registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/centos:7.7.1908
RUN rm -f /etc/localtime && ln -sv /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && echo "Asia/Shanghai" > /etc/timezone
RUN export LANG=zh_CN.UTF-8
# 创立用户和用户组，跟yaml编列里的user: 10000:10000
RUN groupadd --system --gid=10000 hadoop && useradd --system --home-dir /home/hadoop --uid=10000 --gid=hadoop hadoop -m
# 装置sudo
RUN yum -y install sudo ; chmod 640 /etc/sudoers
# 给hadoop添加sudo权限
RUN echo "hadoop ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
RUN yum -y install install net-tools telnet wget nc
RUN mkdir /opt/apache/
# 添加装备 JDK
ADD zulu20.30.11-ca-jdk20.0.1-linux_x64.tar.gz /opt/apache/
ENV JAVA_HOME /opt/apache/zulu20.30.11-ca-jdk20.0.1-linux_x64
ENV PATH $JAVA_HOME/bin:$PATH
# 添加装备 trino server
ENV TRINO_VERSION 416
ADD trino-server-${TRINO_VERSION}.tar.gz /opt/apache/
ENV TRINO_HOME /opt/apache/trino
RUN ln -s /opt/apache/trino-server-${TRINO_VERSION} $TRINO_HOME
# 创立装备目录和数据源catalog目录
RUN mkdir -p ${TRINO_HOME}/etc/catalog
# 添加装备 trino cli
COPY trino-cli-416-executable.jar $TRINO_HOME/bin/trino-cli
# copy bootstrap.sh
COPY bootstrap.sh /opt/apache/
RUN chmod +x /opt/apache/bootstrap.sh ${TRINO_HOME}/bin/trino-cli
RUN chown -R hadoop:hadoop /opt/apache
WORKDIR $TRINO_HOME

bootstrap.sh 脚本内容

#!/usr/bin/env sh
wait_for() {
    if [ -n "$1" -a  -z -n "$2" ];then
       echo Waiting for $1 to listen on $2...
       while ! nc -z $1 $2; do echo waiting...; sleep 1s; done
    fi
}
start_trino() {
   wait_for $1 $2
   ${TRINO_HOME}/bin/launcher run --verbose
}
case $1 in
        trino-coordinator)
                start_trino coordinator $@
                ;;
        trino-worker)
                start_trino worker $@
                ;;
        *)
                echo "请输入正确的服务启动命令~"
        ;;
esac

构建镜像：

docker build -t registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/trino-k8s:416 . --no-cache
### 参数解说
# -t：指定镜像名称
# . ：当时目录Dockerfile
# -f：指定Dockerfile途径
#  --no-cache：不缓存

2）values.yaml 文件装备

# Default values for trino.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
image:
  repository: registry.cn-hangzhou.aliyuncs.com/bigdata_cloudnative/trino-k8s
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart version.
  tag: 416
imagePullSecrets:
  - name: registry-credentials
server:
  workers: 1
  node:
    environment: production
    dataDir: /opt/apache/trino/data
    pluginDir: /opt/apache/trino/plugin
  log:
    trino:
      level: INFO
  config:
    path: /opt/apache/trino/etc
    http:
      port: 8080
    https:
      enabled: false
      port: 8443
      keystore:
        path: ""
    # Trino supports multiple authentication types: PASSWORD, CERTIFICATE, OAUTH2, JWT, KERBEROS
    # For more info: https://trino.io/docs/current/security/authentication-types.html
    authenticationType: ""
    query:
      maxMemory: "1GB"
      maxMemoryPerNode: "512MB"
    memory:
      heapHeadroomPerNode: "512MB"
  exchangeManager:
    name: "filesystem"
    baseDir: "/tmp/trino-local-file-system-exchange-manager"
  workerExtraConfig: ""
  coordinatorExtraConfig: ""
  autoscaling:
    enabled: false
    maxReplicas: 5
    targetCPUUtilizationPercentage: 50
accessControl: {}
  # type: configmap
  # refreshPeriod: 60s
  # # Rules file is mounted to /etc/trino/access-control
  # configFile: "rules.json"
  # rules:
  #   rules.json: |-
  #     {
  #       "catalogs": [
  #         {
  #           "user": "admin",
  #           "catalog": "(mysql|system)",
  #           "allow": "all"
  #         },
  #         {
  #           "group": "finance|human_resources",
  #           "catalog": "postgres",
  #           "allow": true
  #         },
  #         {
  #           "catalog": "hive",
  #           "allow": "all"
  #         },
  #         {
  #           "user": "alice",
  #           "catalog": "postgresql",
  #           "allow": "read-only"
  #         },
  #         {
  #           "catalog": "system",
  #           "allow": "none"
  #         }
  #       ],
  #       "schemas": [
  #         {
  #           "user": "admin",
  #           "schema": ".*",
  #           "owner": true
  #         },
  #         {
  #           "user": "guest",
  #           "owner": false
  #         },
  #         {
  #           "catalog": "default",
  #           "schema": "default",
  #           "owner": true
  #         }
  #       ]
  #     }
additionalNodeProperties: {}
additionalConfigProperties: {}
additionalLogProperties: {}
additionalExchangeManagerProperties: {}
eventListenerProperties: {}
#additionalCatalogs: {}
additionalCatalogs:
  mysql: |-
    connector.name=mysql
    connection-url=jdbc:mysql://mysql-primary.mysql:3306
    connection-user=root
    connection-password=WyfORdvwVm
  hive: |-
    connector.name=hive
    hive.metastore.uri=thrift://hadoop-hadoop-hive-metastore.hadoop:9083
    hive.allow-drop-table=true
    hive.allow-rename-table=true
    #hive.config.resources=/tmp/core-site.xml,/tmp/hdfs-site.xml
# Array of EnvVar (https://v1-18.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#envvar-v1-core)
env: []
initContainers: {}
  # coordinator:
  #   - name: init-coordinator
  #     image: busybox:1.28
  #     imagePullPolicy: IfNotPresent
  #     command: ['sh', '-c', "until nslookup myservice.$(cat /var/run/secrets/kubernetes.io/serviceaccount/namespace).svc.cluster.local; do echo waiting for myservice; sleep 2; done"]
  # worker:
  #   - name: init-worker
  #     image: busybox:1.28
  #     command: ['sh', '-c', 'echo The worker is running! && sleep 3600']
securityContext:
  runAsUser: 10000
  runAsGroup: 10000
service:
  #type: ClusterIP
  type: NodePort
  port: 8080
  nodePort: 31880
nodeSelector: {}
tolerations: []
affinity: {}
auth: {}
  # Set username and password
  # https://trino.io/docs/current/security/password-file.html#file-format
  # passwordAuth: "username:encrypted-password-with-htpasswd"
serviceAccount:
  # Specifies whether a service account should be created
  create: false
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name: ""
  # Annotations to add to the service account
  annotations: {}
secretMounts: []
coordinator:
  jvm:
    maxHeapSize: "2G"
    gcMethod:
      type: "UseG1GC"
      g1:
        heapRegionSize: "32M"
  additionalJVMConfig: {}
  resources: {}
    # We usually recommend not to specify default resources and to leave this as a conscious
    # choice for the user. This also increases chances charts run on environments with little
    # resources, such as Minikube. If you do want to specify resources, uncomment the following
    # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
    # limits:
    #   cpu: 100m
    #   memory: 128Mi
    # requests:
    #   cpu: 100m
    #   memory: 128Mi
worker:
  jvm:
    maxHeapSize: "2G"
    gcMethod:
      type: "UseG1GC"
      g1:
        heapRegionSize: "32M"
  additionalJVMConfig: {}
  resources: {}
    # We usually recommend not to specify default resources and to leave this as a conscious
    # choice for the user. This also increases chances charts run on environments with little
    # resources, such as Minikube. If you do want to specify resources, uncomment the following
    # lines, adjust them as necessary, and remove the curly braces after 'resources:'.
    # limits:
    #   cpu: 100m
    #   memory: 128Mi
    # requests:
    #   cpu: 100m
    #   memory: 128Mi

3）trino catalog configmap yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ template "trino.catalog" . }}
  labels:
    app: {{ template "trino.name" . }}
    chart: {{ template "trino.chart" . }}
    release: {{ .Release.Name }}
    heritage: {{ .Release.Service }}
    role: catalogs
data:
  tpch.properties: |
    connector.name=tpch
    tpch.splits-per-node=4
  tpcds.properties: |
    connector.name=tpcds
    tpcds.splits-per-node=4
{{- range $catalogName, $catalogProperties := .Values.additionalCatalogs }}
  {{ $catalogName }}.properties: |
    {{- $catalogProperties | nindent 4 }}
{{- end }}

这儿只是列举出中心布置装备，最下面会供给git下载地址，有任何疑问欢迎留言或私信~

4）开端装置

cd trino-on-kubernetes
# 装置
helm install trino ./ -n trino --create-namespace
# 更新
# helm upgrade trino ./ -n trino
# 卸载
# helm uninstall trino -n trino
# 检查
kubectl get pods,svc -n trino

5）测验验证

coordinator_name=`kubectl get pods -n trino|grep coordinator|awk '{print $1}'`
# 登录
kubectl exec -it $coordinator_name -n trino -- /opt/apache/trino/bin/trino-cli --server http://trino-coordinator:8080 --catalog=hive --schema=default --user=hadoop
# 检查数据源
show catalogs;
select * from system.runtime.nodes;

四、装备 k8s hive 数据源

hive on k8s 能够参阅我这篇文章：Hadoop on k8s 快速布置进阶精简篇

在 trino-on-kubernetes/values.yaml 文件中添加数据源

重新更新装备并重启 trino节点

helm upgrade trino ./ -n trino
# 重启，由于修正configmap是不会动态刷新的，得重启才生效
kubectl delete pod -n trino `kubectl get pods -n trino|awk 'NR!=1{print $1}'`
coordinator_name=`kubectl get pods -n hadoop|grep coordinator|awk '{print $1}'`
# 登录
kubectl exec -it $coordinator_name -n trino -- ${TRINO_HOME}/bin/trino-cli --server http://trino-coordinator:8080 --catalog=hive --schema=default --user=hadoop
# 检查数据源
show catalogs;
# 检查mysql库
show schemas from hive;
# 检查表
show tables from hive.default;
create schema hive.test;
# 创立表
CREATE TABLE hive.test.movies (
  movie_id bigint,
  title varchar,
  rating real, -- real类似与float类型
  genres varchar,
  release_year int
)
WITH (
  format = 'ORC',
  partitioned_by = ARRAY['release_year'] -- 留意这儿的分区字段必须是上面次序的最终一个
);
#加载数据到Hive表
INSERT INTO hive.test.movies
VALUES 
(1, 'Toy Story', 8.3, 'Animation|Adventure|Comedy', 1995), 
(2, 'Jumanji', 6.9, 'Action|Adventure|Family', 1995), 
(3, 'Grumpier Old Men', 6.5, 'Comedy|Romance', 1995);
# 查询数据
select * from hive.test.movies;

五、快速布置中心操作步骤（假如只关注布置，可直接跳转这儿）

假如只是想快速布置，上面的内容就能够直接疏忽了，直接履行下面步骤即可：

1）装置 git

# 1、装置 git
yum -y install git

2）下载trino装置包

git clone git@github.com:HBigdata/trino-on-kubernetes.git
cd trino-on-kubernetes

3）装备数据源

cat -n values.yaml

3）装备资源限制 requests 和 limits

4）修复 trino 装备

JVM 内存装备

5）开端布置

# git clone git@github.com:HBigdata/trino-on-kubernetes.git
# cd trino-on-kubernetes
# 装置
helm install trino ./ -n trino --create-namespace
# 更新
helm upgrade trino ./ -n trino
# 卸载
helm uninstall trino -n trino

6）测验验证

coordinator_name=`kubectl get pods -n trino|grep coordinator|awk '{print $1}'`
# 登录
kubectl exec -it $coordinator_name -n trino -- ${TRINO_HOME}/bin/trino-cli --server http://trino-coordinator:8080 --catalog=hive --schema=default --user=hadoop
# 检查数据源
show catalogs;
# 检查mysql库
show schemas from hive;
# 检查表
show tables from hive.default;
create schema hive.test;
# 创立表
CREATE TABLE hive.test.movies (
  movie_id bigint,
  title varchar,
  rating real, -- real类似与float类型
  genres varchar,
  release_year int
)
WITH (
  format = 'ORC',
  partitioned_by = ARRAY['release_year'] -- 留意这儿的分区字段必须是上面次序的最终一个
);
#加载数据到Hive表
INSERT INTO hive.test.movies
VALUES 
(1, 'Toy Story', 8.3, 'Animation|Adventure|Comedy', 1995), 
(2, 'Jumanji', 6.9, 'Action|Adventure|Family', 1995), 
(3, 'Grumpier Old Men', 6.5, 'Comedy|Romance', 1995);
# 查询数据
select * from hive.test.movies;

到这儿完成 trino on k8s 布置和可用性演示就完成了，有任何疑问请关注我公众号：大数据与云原生技能共享，加群交流或私信交流，如本篇文章对您有所协助，麻烦帮助一键三连（点赞、转发、保藏）~

Trino on k8s 编排部署进阶篇