Parafac-spark: parallel tensor decompositions on spark

Bekçe, Selim Eren

Parafac-spark: parallel tensor decompositions on spark

buir.advisor	Aykanat, Cevdet
dc.contributor.author	Bekçe, Selim Eren
dc.date.accessioned	2019-08-19T06:46:54Z
dc.date.available	2019-08-19T06:46:54Z
dc.date.copyright	2019-08
dc.date.issued	2019-08
dc.date.submitted	2019-08-09
dc.description	Cataloged from PDF version of article.	en_US
dc.description	Includes bibliographical references (leaves 61-62).	en_US
dc.description.abstract	Tensors are higher order matrices, widely used in many data science applications and scienti c disciplines. The Canonical Polyadic Decomposition (also known as CPD/PARAFAC) is a widely adopted tensor factorization to discover and extract latent features of tensors usually applied via alternating squares (ALS) method. Developing e cient parallelization methods of PARAFAC on commodity clusters is important because as common tensor sizes reach billions of nonzeros, a naive implementation would require infeasibly huge intermediate memory sizes. Implementations of PARAFAC-ALS on shared and distributedmemory systems are available, but these systems require expensive cluster setups, are too low level, not compatible with modern tooling and not fault tolerant by design. Many companies and data science communities widely prefer Apache Spark, a modern distributed computing framework with in-memory caching, and Hadoop ecosystem of tools for their ease of use, compatibility, ability to run on commodity hardware and fault tolerance. We developed PARAFAC-SPARK, an e cient, parallel, open-source implementation of PARAFAC on Spark, written in Scala. It can decompose 3D tensors stored in common coordinate format in parallel with low memory footprint by partitioning them as grids and utilizing compressed sparse rows (CSR) format for e cient traversals. We followed and combined many of the algorithmic and methodological improvements of its predecessor implementations on Hadoop and distributed memory, and adapted them for Spark. During the kernel MTTKRP operation, by applying a multi-way dynamic partitioning scheme, we were also able to increase the number of reducers to be on par with the number of cores to achieve better utilization and reduced memory footprint. We ran PARAFAC-SPARK with some real world tensors and evaluated the e ectiveness of each improvement as a series of variants compared with each other, as well as with some synthetically generated tensors up to billions of rows to measure its scalability. Our fastest variant (PS-CSRSX ) is up to 67% faster than our baseline Spark implementation (PS-COO) and up to 10 times faster than the state of art Hadoop implementations.	en_US
dc.description.statementofresponsibility	by Selim Eren Belçe	en_US
dc.embargo.release	2020-02-08	en_US
dc.format.extent	xiii, 62 leaves : charts ; 30 cm.	en_US
dc.identifier.itemid	B14403
dc.identifier.uri	http://hdl.handle.net/11693/52346
dc.language.iso	English	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	Spark	en_US
dc.subject	Tensor	en_US
dc.subject	Factorization	en_US
dc.subject	Decomposition	en_US
dc.subject	Parafac	en_US
dc.subject	CPD	en_US
dc.subject	ALS	en_US
dc.subject	Alternating squares	en_US
dc.subject	PARAFAC-ALS	en_US
dc.subject	CPD-ALS	en_US
dc.subject	Hadoop	en_US
dc.subject	Grid	en_US
dc.subject	Partitioning	en_US
dc.subject	Data science	en_US
dc.subject	Big data	en_US
dc.title	Parafac-spark: parallel tensor decompositions on spark	en_US
dc.title.alternative	Parafac-spark: spark ile paralel tensör ayrışımları	en_US
dc.type	Thesis	en_US
thesis.degree.discipline	Computer Engineering
thesis.degree.grantor	Bilkent University
thesis.degree.level	Master's
thesis.degree.name	MS (Master of Science)

Files

Original bundle

Now showing 1 - 1 of 1

Name:: thesis.pdf
Size:: 1.08 MB
Format:: Adobe Portable Document Format
Description:: Full printable version

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Graduate School of Engineering and Science