Spark性能优化之foreach与foreachPartition

首先,我们对比一下foreachPartitionforeach两个方法的实现,有什么不同的地方:

1
2
3
4
5
6
7
8
9
def foreach(f: T => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}

def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
}

Scala异常获取一例

在处理第11行读文件时,由于数据文件出现的不规律,在指定日期内可能存在日志文件不存在的情况,这里需要处理下异常:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def readLog(sc: SparkContext, startDate: String, endDate: String, logNames: List[String]): RDD[String] = {
val dateLst = DateUtils.getDateListBetweenTwoDate(startDate, endDate)

var logRdd = sc.makeRDD(List[String]())
for (date <- dateLst) {
val year = date.substring(0, 4)
val month = date.substring(4, 6)
val day = date.substring(6, 8)
for (logName <- logNames) {
val logRdd = logRdd.union(
try {sc.textFile(s"cosn://fuge/mid-data/fuge/ssp/bid-log/$year/$month/$day/${logName}*")
.map(x => x.split("\\|", -1))
.filter(x => x.length >= 2 && (x(1).trim == "6" || x(1).trim == "0")).map(_.toString) // 0和6为请求成功的状态码
} catch {
case _: Exception => sc.makeRDD(List[String]())
}
)
}
}
logRdd
}

Gitlab问题小结

supervise_redis_sleep 长时间卡死

解决方案:

1、按住CTRL+C强制结束;

2、运行:sudo systemctl restart gitlab-runsvdir;

3、再次执行:sudo gitlab-ctl reconfigure

无GUI的CentOS上使用Selenium+Chrome

客户的网站上的监测代码最近连续两次在网站更新时被清除掉,导致无法正常获取网站访问数据,影响到后续大数据分析。

为解决这个问题,决定使用Python Selenium模块来实现网站按钮模拟点击,同时监测我们后台是否能正常收到,以此来判断网站按钮监测代码是否有正常部署。

Selenium很好用很强大,开发和部署也都很简单,是自动化测试非常好的工具,但是问题是我们需要在无GUI的服务器上进行部署,这就牵涉到在无GUI的服务器上安装浏览器的问题,我这里选择的是Chrome。


下面简单分享一个部署过程中遇到的坑,也当作是总结。

V

V’s speech is recognized by the analysts at Smith Change the World Incorporated as one of the most influential speeches of the near future.

Nginx开启HTTPS反向代理访问Jira失败

使用的Jira 版本:v7.1.1,之前一直是http访问,在买完证书,部署好https访问时,界面一直在提醒:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
We've detected a potential problem with JIRA's Dashboard configuration that your administrator can correct. Click here to learn more

We've detected a potential problem with JIRA's Dashboard configuration that your administrator can correct. Hide
Dashboard Diagnostics: Mismatched URL Scheme

JIRA is reporting that it is using the URL scheme 'http', which does not match the scheme used to run these diagnostics, 'https'. This is known to cause JIRA to construct URLs using an incorrect hostname, which will result in errors in the dashboard, among other issues.

The most common cause of this is the use of a reverse-proxy HTTP(S) server (often Apache or IIS) in front of the application server running JIRA. While this configuration is supported, some additional setup might be necessary in order to ensure that JIRA detects the correct scheme.

The following articles describe the issue and the steps you should take to ensure that your web server and app server are configured correctly:

Gadgets do not display correctly after upgrade to JIRA 4.0
Integrating JIRA with Apache
Integrating JIRA with Apache using SSL

If you believe this diagnosis is in error, or you have any other questions, please contact Atlassian Support.

Detailed Error

com.atlassian.gadgets.dashboard.internal.diagnostics.UrlSchemeMismatchException: Detected URL scheme, 'http', does not match expected scheme 'https'

FreeIPA Client 端部署

新增IPA客户端主机

修改DNS服务器

vim /etc/resolv.conf
把下面两行放最上面

1
2
search bd.example.com.cn
nameserver 192.168.2.150

修改hosts文件中第二行旧的主机名(很重要)