Parafac-spark: parallel tensor decompositions on spark

buir.advisorAykanat, Cevdet
dc.contributor.authorBekçe, Selim Eren
dc.date.accessioned2019-08-19T06:46:54Z
dc.date.available2019-08-19T06:46:54Z
dc.date.copyright2019-08
dc.date.issued2019-08
dc.date.submitted2019-08-09
dc.departmentDepartment of Computer Engineeringen_US
dc.descriptionCataloged from PDF version of article.en_US
dc.descriptionThesis (M.S.): Bilkent University, Department of Computer Engineering, İhsan Doğramacı Bilkent University, 2019.en_US
dc.descriptionIncludes bibliographical references (leaves 61-62).en_US
dc.description.abstractTensors are higher order matrices, widely used in many data science applications and scienti c disciplines. The Canonical Polyadic Decomposition (also known as CPD/PARAFAC) is a widely adopted tensor factorization to discover and extract latent features of tensors usually applied via alternating squares (ALS) method. Developing e cient parallelization methods of PARAFAC on commodity clusters is important because as common tensor sizes reach billions of nonzeros, a naive implementation would require infeasibly huge intermediate memory sizes. Implementations of PARAFAC-ALS on shared and distributedmemory systems are available, but these systems require expensive cluster setups, are too low level, not compatible with modern tooling and not fault tolerant by design. Many companies and data science communities widely prefer Apache Spark, a modern distributed computing framework with in-memory caching, and Hadoop ecosystem of tools for their ease of use, compatibility, ability to run on commodity hardware and fault tolerance. We developed PARAFAC-SPARK, an e cient, parallel, open-source implementation of PARAFAC on Spark, written in Scala. It can decompose 3D tensors stored in common coordinate format in parallel with low memory footprint by partitioning them as grids and utilizing compressed sparse rows (CSR) format for e cient traversals. We followed and combined many of the algorithmic and methodological improvements of its predecessor implementations on Hadoop and distributed memory, and adapted them for Spark. During the kernel MTTKRP operation, by applying a multi-way dynamic partitioning scheme, we were also able to increase the number of reducers to be on par with the number of cores to achieve better utilization and reduced memory footprint. We ran PARAFAC-SPARK with some real world tensors and evaluated the e ectiveness of each improvement as a series of variants compared with each other, as well as with some synthetically generated tensors up to billions of rows to measure its scalability. Our fastest variant (PS-CSRSX ) is up to 67% faster than our baseline Spark implementation (PS-COO) and up to 10 times faster than the state of art Hadoop implementations.en_US
dc.description.degreeM.S.en_US
dc.description.statementofresponsibilityby Selim Eren Belçeen_US
dc.embargo.release2020-02-08en_US
dc.format.extentxiii, 62 leaves : charts ; 30 cm.en_US
dc.identifier.itemidB14403
dc.identifier.urihttp://hdl.handle.net/11693/52346
dc.language.isoEnglishen_US
dc.publisherBilkent Universityen_US
dc.rightsinfo:eu-repo/semantics/openAccessen_US
dc.subjectSparken_US
dc.subjectTensoren_US
dc.subjectFactorizationen_US
dc.subjectDecompositionen_US
dc.subjectParafacen_US
dc.subjectCPDen_US
dc.subjectALSen_US
dc.subjectAlternating squaresen_US
dc.subjectPARAFAC-ALSen_US
dc.subjectCPD-ALSen_US
dc.subjectHadoopen_US
dc.subjectGriden_US
dc.subjectPartitioningen_US
dc.subjectData scienceen_US
dc.subjectBig dataen_US
dc.titleParafac-spark: parallel tensor decompositions on sparken_US
dc.title.alternativeParafac-spark: spark ile paralel tensör ayrışımlarıen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
thesis.pdf
Size:
1.08 MB
Format:
Adobe Portable Document Format
Description:
Full printable version
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: