Jenkins Master takes lot of time to reboot - jenkins

We have Jenkins Master with about 27K jobs at any point in time. It does do any builds, it uses slaves for that. The configuration of the machine is 12 Core, 64 GB RAM, and SSD storage.
Whenever we reboot the master, it takes 15-30 mins to start. This put pressure on our maintenance window and hits our uptime SLAs.
What would you suggest we could do to reduce this reboot time?

Related

“Cannot allocate memory” when starting new Flink job

We are running Flink on a 3 VM cluster. Each VM has about 40 Go of RAM. Each day we stop some jobs and start new ones. After some days, starting a new job is rejected with a "Cannot allocate memory" error :
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000340000000, 12884901888, 0) failed; error='Cannot allocate memory' (errno=12)
Investigations show that the task manager RAM is ever growing, to the point it exceeds the allowed 40 Go, although the jobs are canceled.
I don't have access (yet) to the cluster so I tried some tests on a standalone cluster on my laptop and monitored the task manager RAM:
With jvisualvm I can see everything working as intended. I load the job memory, then clean it and wait (a few minutes) for the GB to fire up. The heap is released.
Whereas with top, memory is - and stay - high.
At the moment we are restarting the cluster every morning to account for this memory issue, but we can't afford it anymore as we'll need jobs running 24/7.
I'm pretty sure it's not a Flink issue but can someone point me in the right direction about what we're doing wrong here?
On standalone mode, Flink may not release resources as you wished.
For example, resources holden by static member in an instance.
It is highly recommended using YARN or K8s as runtime env.

How to scaling up with Akka

I tested 10 routees of Akka Actor in a machine and did the task I want (read from DB and make API call) in 3 minutes. Then, I changed my configuration to create 50 routees and it did the task in equal minutes.
I also increased the CPU and RAM of my VM but the time it took didnot go down. My plan is by increasing the CPU cores and RAM, I want to do the task in SHORTER time.
How do I achieve that?

“Size in Memory” under storage tab of spark UI showing increase in RAM usage over time for spark streaming

I am using spark streaming in my application. Data comes in the form of streaming files every 15 minute. I have allocated 10G of RAM to spark executors. With this setting my spark application is running fine.
But by looking the spark UI, under Storage tab -> Size in Memory the usage of RAM keep on increasing over the time.
When I started streaming job, "Size in Memory" usage was in KB. Today it has been 2 weeks 2 days 22 hours since when I started the streaming job and usage has increased to 858.4 MB.
Also I have noticed on more thing, under Streaming heading:
When I started the job, Processing Time and Total Delay (from the image) was 5 second and which after 16 days, increased to 19-23 seconds while the streaming file size is almost same.
Before increasing the executor memory to 10G, spark jobs keeps on failing almost every 5 days (with default executor memory which is 1GB). With increase of executor memory to 10G, it is running continuously from more than 16 days.
I am worried about the memory issues. If "Size in Memory" values keep on increasing like this, then sooner or later I will run out of RAM and spark job will get fail again with 10G of executor memory as well. What I can do to avoid this? Do I need to do some configuration?
Just to give the context of my spark application, I have enable following properties in spark context:
SparkConf sparkConf = new SparkConf().setMaster(sparkMaster). .set("spark.streaming.receiver.writeAheadLog.enable", "true")
.set("spark.streaming.minRememberDuration", 1440);
And also, I have enable checkpointing like following:
sc.checkpoint(hadoop_directory)
I want to highlight one more thing is that I was having issue while enabling checkpointing. Regarding checkpointing issue, I have already posted a question on following link:
Spark checkpoining error when joining static dataset with DStream
I was not able to set the checkpointing the way I wanted, so did it differently (highlighted above) and it is working fine now. I am not asking checkpointing question again, however I mentioned it so that it might help you to understand if current memory issue somehow related to previous one (checkpointing).
Environment detail:
Spark 1.4.1 with two node cluster of centos 7 machines. Hadoop 2.7.1.
I am worried about the memory issues. If "Size in Memory" values keep on increasing like this, then sooner or later I will run out of RAM and spark job will get fail again with 10G of executor memory as well.
No, that's not how RAM works. Running out is perfectly normal, and when you run out, you take RAM that you are using for less important purposes and use it for more important purposes.
For example, if your system has free RAM, it can try to keep everything it wrote to disk in RAM. Who knows, somebody might try to read it from disk again and having it in RAM will save an I/O operation. Since free RAM is forever wasted (it's not like you can use 1GB less today to use 1GB more tomorrow, any RAM not used right now is potential to avoid I/O that's forever lost) you might as well use it for anything that might help. But that doesn't mean it can't evict those things from RAM when it needs RAM for some other purpose.
It is not at all unusual, on a typical system, for almost all of its RAM to be used and almost all of its RAM to also be available. This is typical behavior on most modern systems.

Out of memory on all Solr slaves of the same shard

We are using Apache Solr 3.5 to drive our website catalog search. We use Field collapsing feature with multiple shards, each shard supporting a cluster of read only slaves.
Recently, we ran into Out of Memory errors on all slaves of a particular shard. We use field collapsing on a particular field which has only one specific value on all the documents of the shard, whose slaves went out of memory. Interestingly, the Out of Memory error recurred multiple times during the day (about 4 times in 24 hours) without any significant deviation in traffic from the normal. The max heap size allocated to the each slave is 8 Gb on a 16 Gb Machine.
Since then we have done the following and the problem seems to be arrested for now -
Added more horizontal slaves to the problem causing slave group, from 3 we have brought this up to 6.
We have increased the replication poll interval from 5 minutes to 20 minutes. We found out that the background process SolrSearchIndexer.warm is consuming the maximum amount of heap space(about 6 Gb), precisely when the queries start going out of memory. Since a replication interval causes the warming of searchers we thought of increasing the frequency.
We have decreased the minimum heap allocation to tomcat on all slaves of this group to 1Gb. Earlier this was 4Gb.
One of 3 problem slaves was having write.lock exceptions on an unused core. We have since then removed the unused core on all slaves since it was replicating from another master Solr. The unused core had about 1.5 million docs that consumes about 605 Mb on the disk.
We dropped the entire index on all the slaves and replicated everything from scratch. Incidentally one of the slaves had an unusually big size of index on the disk - 2.2 Gb as compared to 1 Gb on other slaves.
The typical size of index directory on the problem shard is around 1Gb, about 1 million documents in all. The average requests served are about 10/second for each slave.
We have tried replaying the entire logs for the day on a test environment but somehow the test solr never goes out of memory with the same heap settings. Frankly, we are not certain that this would not happen again.
Can someone suggest what could be the problem here ? Any help would be greatly appreciated.
Thanks,
Tushar
I suspect that it comes to the caches definition. How many searchers have you allowed to reside in parallel (defaults to 2 but you can change)? Searcher warmup is actually a cache warmup, so if you have a working searcher and a warming searcher, it occupies twice a memory size. What caches do you utilize (document/query/filter/field/custom)? Are you using facets extensively (they use field cache internally)? Many different filter queries (FQ) (again, cached bitmap)?
I think that field collapsing also uses field cache.
It is quiet sometime since this happened but I think it will be worthwhile to share the reason here. Our website was being scraped by someone who was using very big start param in the queries. The Solr distributed index has a limitation on the size of the start parameter(in excess on 500000). The out of memory used to occur when a heavy replication happened and the coordinating shard already had a lot of documents in memory coming in from the contributing nodes because of a high start parameter.
Details can be found here - https://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations
Our solution was to put a cap on the start parameter to about 1000 as humans rarely go beyond the first few pages of listing.

Jenkins/Hudson Master server? [closed]

Currently I have Jenkins/Hudson master in VM with 8G ram and 250 GB space and connected different slaves which are Physical box and is this okay? More over no jobs build jobs are running Or Master should be a physical box? Any performance improve I would gain? Please suggest.
Thanks
If the master is not building components itself (# of executors in Node-Configuration) it does not do anything else then providing the Webinterface to Jenkins. I don't think that there is a big difference in comparison to a physical webserver.
Usually using a VM as a build-system is slower than directly using physical hardware. But this depends on the software you're using for running the VM.

Resources