How to use split large dataset in train/test set but also use pandas batchsize itererations for updating - loops

I am updating my parameters every iteration with a batch from a very large file. But before I do this I want to split the entire large dataset in a test and a train set. And with crossvalidation I want to do the same.
I have tried to use dask to split the entire set and then transform a partion to pandas to use batches for updating my algorithm.
the dask part (which I if possible would rather not use):
dict_bag=dff.read_csv("gdrive/My Drive/train_triplets.txt", blocksize=int(1e9),sep='\s+',header=None)
df_train, df_test = df_bag.random_split([2/3, 1/3], random_state=0)
df_batch=df_train.loc[1:1000].compute()
the pandas part:
df_chunk = pd.read_csv("gdrive/My Drive/train_triplets.txt", chunksize=6000000,sep='\s+',header=None)
for chunk in df_chunk:
#### here I have my algorithm
I expect that it is possible to have a pandas function to create a pd file with chunksizes from a url as I already have but then split in a train and a test set. So that I can iterate in batches over the large train and test set individually. And so that I can also split the trainset for me to perform crossvalidation.
Edit: My Dataframe is a textfilereader, how do I get a train and test set from this and/or can I do crossvalidation

Related

Efficient way to read very wide dataset in scala or java [duplicate]

I am getting the fixed width .txt source file from which I need to extract the 20K columns.
As lack of libraries to process fixed width files using spark, I have developed the code which extracts the fields from fixed width text files.
Code read the text file as RDD with
sparkContext.textFile("abc.txt")
then reads JSON schema and gets the column names and width of each column.
In the function I read the fixed length string and using the start and end position we use substring function to create the Array.
Map the function to RDD.
Convert the above RDD to DF and map colnames and write to Parquet.
The representative code
rdd1=spark.sparkContext.textfile("file1")
{ var now=0
{ val collector= new array[String] (ColLenghth.length)
val recordlength=line.length
for (k<- 0 to colLength.length -1)
{ collector(k) = line.substring(now,now+colLength(k))
now =now+colLength(k)
}
collector.toSeq}
StringArray=rdd1.map(SubstrSting(_,ColLengthSeq))
#here ColLengthSeq is read from another schema file which is column lengths
StringArray.toDF("StringCol")
.select(0 until ColCount).map(j=>$"StringCol"(j) as column_seq(j):_*)
.write.mode("overwrite").parquet("c"\home\")
This code works fine with files with less number of columns however it takes lot of time and resources with 20K columns.
As number of columns increases , it also increase the time.
If anyone has faced such issue with large number of columns.
I need suggestions on performance tuning , how can I tune this Job or code

How to load a large csv file, validate each row and process the data

I'm looking to validate each row of a csv file of more than 600 million rows and up to 30 columns (the solution must process several large csv file of that range).
Columns can be text, dates or amounts. The csv must be validated with 40 rules, some rules will check the correctness of the amout, some of them will check the dates (the format), etc…
The results of each validation rule must be saved and will be displayed afterwards.
Once the data validated, a second stage of validation rules will be applied, based this time on sum, averages … the results also of each rule must be saved.
I’m using Spark to load the file. With
session.read().format("com.databricks.spark.csv").option("delimiter",
"|").option("header", "false").csv(csvPath)
Or
session.read().option("header", "true").text(csvPath);
To iterate on each line I see that there is two options:
Use dataset.map( row -> { something });
“Something” should validate each row and save the result somewhere
But as the “something” block will be executed in the executors, I don’t see how to get it back to the driver or store it somewhere from where it can be retrieved from the Driver process.
The second option is to use dataset.collect : but It will cause an outofmemory as all the data will be loaded in the Driver. We could use the method “take” and then delete the subset from the dataset (with a filter) and repeat the operation but I'm not comfortable with this method
I was wondering if someone can suggest me a robust method to deal with this kind of problem. Basically keep Spark for the second stage of validation rules and use Spark or another framwrok to ingest the file and execute and produce the first set of validation rule
Thanks in advance for your help
You can use the SparkSession read the CSV file and then partition the data by a column and process the data in batches. For example you are writing the data to an external DB which does not need much processing.
dataFrame
.write
.mode(saveMode)
.option("batchsize", 100)
.jdbc(url, "tablename", new java.util.Properties())
If your Business Logic demands you to process each and every Row of a Dataset/Dataframe you can use the df.map(). If your logic can work at once on multiple RDD's you can go with df.mapPartition().Tasks with high per-record overhead perform better with a mapPartition than with a map transformation.
Consider the case of Initializing a database. If we are using map() or foreach(), the number of times we would need to initialize will be equal to the no of elements in RDD. Whereas if we use mapPartitions(), the no of times we would need to initialize would be equal to number of Partitions
You can simply append the columns with check results to your original dataframe and use a bunch of rule UDFs to perform the actual validation, something like this:
object Rules {
val rule1UDF = udf(
(col1: String, col2: String) => {
// your validation code goes here
true // the result of validation
}
}
// ...
val nonAggregatedChecksDf = df
.withColumn("rule1_result", Rules.rule1UDF("col1", "col2"))
.withColumn("rule2_result", Rules.rule2UDF("col1", "col3"))
.select("id", "rule1_result", "rule2_result", <all the columns relevant for the aggregation checks>)
val aggregatedChecksDf = nonAggregatedChecksDf
.agg(<...>)
.withColumn("rule3_result", Rules.rule3UDF("sum1", "avg2"))
.withColumn("rule4_result", Rules.rule4UDF("count1", "count3"))
.select("id", "rule1_result", "rule2_result", "rule3_result", "rule4_result")
The second option is to use dataset.collect
I'd advice not to do that but rather select a key field from your original dataframe plus all the check result columns and save them in a columnar format as parquet.
aggregatedChecksDf
.select("id", "rule1_result", "rule2_result", "rule3_result", "rule4_result")
.write
.mode(saveMode)
.parquet(path)
This will be much faster as the writes are done by all executors in parallel and driver doesn't become a bottleneck. It'll also would most likely help to avoid OOM issues as the memory usage is spread along all the executors.

Creating subsets of data using bash and measuring comparisons in java programs

I am stumped on this question I have to do. Its part of some extra exercises I am doing for my Computer Science course.
We have to conduct an experiment to count the number of comparisons that a binary search tree performs versus a traditional array. I have written both programs in java and read data from a large data set (that contains dam information like names, levels, locations etc.), extracting the necessary information and storing the data in objects in the array/binary tree.
I am stuck on this question:
Conduct an experiment with DamArrayApp (array) and DamBSTApp (binary search tree) to demonstrate the speed difference for searching between a BST and a traditional array.
You want to vary the size of the dataset (n) and measure the number of comparison operations in the best/average/worst case for every value of n (from 1-211). For each value of n:
Create a subset of the sample data (hint: use the Unix head command).
Run both instrumented applications for every dam name in the subset of the data file. Store all operation count values.
Determine the minimum (best case), maximum (worst case) and average of these count values.
It is recommended that you use Unix or Python scripts to automate this process.
I do not know how to even start this. I know to use BASH, but have not been taught it. The data set is a csv file that has 211 rows of data, so I need to make that shorter and count the operations for each case. Any help would greatly be appreciated. Even help with the bash script would suffice.
I need the bash script to turn the data file from this:
row1
row2
.
.
.
row211
To something like:
row1
row2
And then another subset like:
row1
row2
row3
row4
(Basically a subset for every n up to 211)

LSH Spark stucks forever at approxSimilarityJoin() function

I am trying to implement LSH spark to find nearest neighbours for each user on very large datasets containing 50000 rows and ~5000 features for each row. Here is the code related to this.
MinHashLSH mh = new MinHashLSH().setNumHashTables(3).setInputCol("features")
.setOutputCol("hashes");
MinHashLSHModel model = mh.fit(dataset);
Dataset<Row> approxSimilarityJoin = model .approxSimilarityJoin(dataset, dataset, config.getJaccardLimit(), "JaccardDistance");
approxSimilarityJoin.show();
The job gets stuck at approxSimilarityJoin() function and never goes beyond it. Please let me know how to solve it.
It will finish if you leave it long enough, however there are some things you can do to speed it up. Reviewing the source code you can see the algorithm
hashes the inputs
joins the 2 datasets on the hashes
computes the jaccard distance using a udf and
filters the dataset with your threshold.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala
The join is probably the slow part here as the data is shuffled. So some things to try:
change your dataframe input partitioning
change spark.sql.shuffle.partitions (the default gives you 200 partitions after a join)
your dataset looks small enough where you could use spark.sql.functions.broadcast(dataset) for a map-side join
Are these vectors sparse or dense? the algorithm works better with sparseVectors.
Of these 4 options 2 and 3 have worked best for me while always using sparseVectors.

Apache Spark - Split Dataset Rows at Specific Line - Java/Pyspark

I'm working through this initially in a Pyspark shell with plans to later write Java code to accomplish the same things.
I'm reading in a text file into a DataFrame.
df = spark.read.text("log.txt")
The text file is a log file broken up into two sections delineated by a line containing a specific string. I would like to split the initial DataFrame into two separate DataFrames, one for each section. I've struggled to find info on this kind of manipulation of a DataFrame (sorting rows instead of columns.) My best guess is that the logic would look something like: find the row number with the string then create new DataFrames based on individual row numbers compared to the row number with the string. I don't know if this is actually efficient though.
Is this kind of thing doable with a DataFrame, and if so, how would I go about doing it? Would be more efficient to just read in the initial file line-by-line and create the two DataFrames there?

Resources