iBet uBet web content aggregator. Adding the entire web to your favor.

Link to original content: https://api.crossref.org/works/10.1002/CPE.3621

{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2023,9,10]],"date-time":"2023-09-10T04:54:42Z","timestamp":1694321682687},"reference-count":28,"publisher":"Wiley","issue":"2","license":[{"start":{"date-parts":[[2015,8,28]],"date-time":"2015-08-28T00:00:00Z","timestamp":1440720000000},"content-version":"vor","delay-in-days":0,"URL":"http:\/\/onlinelibrary.wiley.com\/termsAndConditions#vor"}],"funder":[{"DOI":"10.13039\/501100000396","name":"UK Technology Strategy Board","doi-asserted-by":"crossref","id":[{"id":"10.13039\/501100000396","id-type":"DOI","asserted-by":"crossref"}]},{"name":"Rolls-Royce Plc.","award":["EP\/I006079\/1, EP\/I00677X\/1","T\u00c1MOP-4.2.1.\/B-11\/2\/KMR-2011-002","T\u00c1MOP - 4.2.2.\/B-10\/1-2010-0014"]}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Concurrency and Computation"],"published-print":{"date-parts":[[2016,2]]},"abstract":"Summary<\/jats:title>Achieving optimal performance on the latest multi\u2010core and many\u2010core architectures increasingly depends on making efficient use of the hardware's vector units. This paper presents results on achieving high performance through vectorization on CPUs and the Xeon\u2010Phi on a key class of irregular applications: unstructured mesh computations. Using single instruction multiple thread (SIMT) and single instruction multiple data (SIMD) programming models, we show how unstructured mesh computations map to OpenCL or vector intrinsics through the use of code generation techniques in the OP2 Domain Specific Library and explore how irregular memory accesses and race conditions can be organized on different hardware. We benchmark Intel Xeon CPUs and the Xeon\u2010Phi, using a tsunami simulation and a representative CFD benchmark. Results are compared with previous work on CPUs and NVIDIA GPUs to provide a comparison of achievable performance on current many\u2010core systems. We show that auto\u2010vectorization and the OpenCL SIMT model do not map efficiently to CPU vector units because of vectorization issues and threading overheads. In contrast, using SIMD vector intrinsics imposes some restrictions and requires more involved programming techniques but results in efficient code and near\u2010optimal performance, two times faster than non\u2010vectorized code. We observe that the Xeon\u2010Phi does not provide good performance for these applications but is still comparable with a pair of mid\u2010range Xeon chips. Copyright \u00a9 2015 John Wiley & Sons, Ltd.<\/jats:p>","DOI":"10.1002\/cpe.3621","type":"journal-article","created":{"date-parts":[[2015,8,28]],"date-time":"2015-08-28T22:39:17Z","timestamp":1440801557000},"page":"557-577","source":"Crossref","is-referenced-by-count":10,"title":["Vectorizing unstructured mesh computations for many\u2010core architectures"],"prefix":"10.1002","volume":"28","author":[{"given":"I Z.","family":"Reguly","sequence":"first","affiliation":[{"name":"Oxford e\u2010Research Centre University of Oxford Oxford UK"},{"name":"Faculty of Information Technology and Bionics P\u00e1zm\u00e1ny P\u00e9ter Catholic University Budapest Hungary"}]},{"given":"Endre","family":"L\u00e1szl\u00f3","sequence":"additional","affiliation":[{"name":"Oxford e\u2010Research Centre University of Oxford Oxford UK"},{"name":"Faculty of Information Technology and Bionics P\u00e1zm\u00e1ny P\u00e9ter Catholic University Budapest Hungary"}]},{"given":"Gihan R.","family":"Mudalige","sequence":"additional","affiliation":[{"name":"Oxford e\u2010Research Centre University of Oxford Oxford UK"}]},{"given":"Mike B.","family":"Giles","sequence":"additional","affiliation":[{"name":"Oxford e\u2010Research Centre University of Oxford Oxford UK"}]}],"member":"311","published-online":{"date-parts":[[2015,8,28]]},"reference":[{"key":"e_1_2_9_2_1","unstructured":"Top500 systems 2013. (Available from:http:\/\/www.top500.org) [Accessed on 3 August 2015]."},{"key":"e_1_2_9_3_1","doi-asserted-by":"crossref","unstructured":"DallyB.Power programmability and granularity: the challenges of exascale computing.2011 IEEE International Parallel Distributed Processing Symposium (IPDPS) Anchorage Alaska USA 2011;878\u2013878.","DOI":"10.1109\/IPDPS.2011.420"},{"key":"e_1_2_9_4_1","unstructured":"What is GPU Computing 2013. (Available from:http:\/\/www.nvidia.com\/object\/what-is-gpu-computing.html) [Accessed on 3 August 2015]."},{"key":"e_1_2_9_5_1","unstructured":"SkaugenK.Petascale to exascale: extending Intel's HPC commitment 2011. ISC 2010 keynote. (Available from:http:\/\/download.intel.com\/pressroom\/archive\/reference\/ISC_2010_Skaugen_keynote.pdf)."},{"key":"e_1_2_9_6_1","unstructured":"Texas instruments multi\u2010core TMS320C66x processor. (Available from:http:\/\/www.ti.com\/c66multicore) [Accessed on 3 August 2015]."},{"key":"e_1_2_9_7_1","doi-asserted-by":"publisher","DOI":"10.1109\/MM.2011.17"},{"key":"e_1_2_9_8_1","unstructured":"Intel Math Kernel Library 2013. (Available from:http:\/\/software.intel.com\/en-us\/intel-mkl) [Accessed on 3 August 2015]."},{"key":"e_1_2_9_9_1","doi-asserted-by":"crossref","unstructured":"HeineckeA VaidyanathanK SmelyanskiyM KobotovA DubtsovR HenryG ShetAG ChrysosG DubeyP.Design and implementation of the LINPACK benchmark for single and multi\u2010node systems based on Intel Xeon Phi coprocessor.2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS) Boston Massachusetts USA 2013;126\u2013137.","DOI":"10.1109\/IPDPS.2013.113"},{"key":"e_1_2_9_10_1","unstructured":"RosinskiJ.Porting validating and optimizing NOAA weather models NIM and FIM to Intel Xeon Phi.Technical Report NOAA Silver Spring Maryland USA 2013."},{"key":"e_1_2_9_11_1","unstructured":"BrookRG HadriB BetroVC HulguinRC BrabyR.Early application experiences with the Intel MIC architecture in a Cray CX1.Cray User Group (CUG '12) Stuttgart Germany 2012; No. 194."},{"key":"e_1_2_9_12_1","unstructured":"VladimirovA KarpusenkoV.Test\u2010driving Intel Xeon Phi coprocessors with a basic N\u2010body simulation.Technical Report Colfax International Sunnyvale California USA 2013. (Available from:http:\/\/research.colfaxinternational.com\/post\/2013\/01\/07\/Nbody-Xeon-Phi.aspx) [Accessed on 3 August 2015]."},{"key":"e_1_2_9_13_1","doi-asserted-by":"crossref","unstructured":"PennycookSJ HughesCJ SmelyanskiyM JarvisSA.Exploring SIMD for molecular dynamics using Intel Xeon processors and Intel Xeon Phi coprocessors.2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS) Boston Massachusetts USA 2013;1085\u20131097.","DOI":"10.1109\/IPDPS.2013.44"},{"key":"e_1_2_9_14_1","doi-asserted-by":"crossref","unstructured":"SmelyanskiyM SewallJ KalamkarD SatishN DubeyP AstafievN BurylovI NikolaevA MaidanovS LiS et al.Analysis and optimization of financial analytics benchmark on modern multi\u2010 and many\u2010core IA\u2010based architectures.2012 SC Companion High Performance Computing Networking Storage and Analysis (SCC) Salt Lake City Utah USA 2012;1154\u20131162.","DOI":"10.1109\/SC.Companion.2012.139"},{"key":"e_1_2_9_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/2370036.2145824"},{"key":"e_1_2_9_16_1","doi-asserted-by":"publisher","DOI":"10.1093\/comjnl\/bxr062"},{"issue":"4","key":"e_1_2_9_17_1","first-page":"434","article-title":"Using automatic differentiation for adjoint CFD code development","volume":"16","author":"Giles MB","year":"2008","journal-title":"Computational Fluid Dynamics Journal"},{"key":"e_1_2_9_18_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.euromechflu.2011.05.005"},{"key":"e_1_2_9_19_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.parco.2013.09.004"},{"key":"e_1_2_9_20_1","doi-asserted-by":"crossref","unstructured":"RabenseifnerR HagerG JostG.Hybrid MPI\/OpenMP parallel programming on clusters of multi\u2010core SMP nodes.2009 17th Euromicro International Conference on Parallel Distributed and Network\u2010based Processing Weimar Germany 2009;427\u2013436.","DOI":"10.1109\/PDP.2009.43"},{"key":"e_1_2_9_21_1","doi-asserted-by":"publisher","DOI":"10.1109\/99.660313"},{"key":"e_1_2_9_22_1","unstructured":"Scotch and PT\u2010Scotch 2013. (Available from:http:\/\/www.labri.fr\/perso\/pelegrin\/scotch\/) [Accessed on 3 August 2015]."},{"key":"e_1_2_9_23_1","doi-asserted-by":"publisher","DOI":"10.1137\/0724090"},{"key":"e_1_2_9_24_1","unstructured":"Intel SDK for OpenCL applications Intel 2013. (Available from:http:\/\/software.intel.com\/en-us\/vcsource\/tools\/opencl-sdk) [Accessed on 3 August 2015]."},{"key":"e_1_2_9_25_1","unstructured":"Intel C++ Compiler XE 13.1 user and reference guide: Intel 2013. (Available from:https:\/\/software.intel.com\/sites\/products\/documentation\/doclib\/iss\/2013\/compiler\/cpp-lin\/index.htm) [Accessed on 3 August 2015]."},{"key":"e_1_2_9_26_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.jpdc.2012.07.008"},{"key":"e_1_2_9_27_1","unstructured":"NVIDIA Tesla Kepler GPU accelerators 2012. (Available from:http:\/\/www.nvidia.com\/object\/tesla-servers.html) [Accessed on 3 August 2015]."},{"key":"e_1_2_9_28_1","unstructured":"LindbergP.Basic OpenMP threading overhead.Technical Report Intel Santa Clara California USA 2009. (Available from:http:\/\/software.intel.com\/en-us\/articles\/basic-openmp-threading-overhead) [Accessed on 3 August 2015]."},{"key":"e_1_2_9_29_1","unstructured":"OP2 github repository 2013. (Available from:https:\/\/github.com\/OP2\/OP2-Common) [Accessed on 3 August 2015]."}],"container-title":["Concurrency and Computation: Practice and Experience"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/api.wiley.com\/onlinelibrary\/tdm\/v1\/articles\/10.1002%2Fcpe.3621","content-type":"unspecified","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/pdf\/10.1002\/cpe.3621","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2023,9,2]],"date-time":"2023-09-02T13:25:21Z","timestamp":1693661121000},"score":1,"resource":{"primary":{"URL":"https:\/\/onlinelibrary.wiley.com\/doi\/10.1002\/cpe.3621"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2015,8,28]]},"references-count":28,"journal-issue":{"issue":"2","published-print":{"date-parts":[[2016,2]]}},"alternative-id":["10.1002\/cpe.3621"],"URL":"http:\/\/dx.doi.org\/10.1002\/cpe.3621","archive":["Portico"],"relation":{},"ISSN":["1532-0626","1532-0634"],"issn-type":[{"value":"1532-0626","type":"print"},{"value":"1532-0634","type":"electronic"}],"subject":[],"published":{"date-parts":[[2015,8,28]]}}}