SlideShare a Scribd company logo
1 of 16
Download to read offline
Linaro
Connect,
Hong Kong
March 2013
LCE13, Dublin. 8-12 July 2013
Porting - ARMV8 Neon
connect.linaro.org

ARMV8 Neon

Vector operations on ARMV8

Scalar operation on ARMV8

Neon load/store instructions

Load/Store illustration

Widening/Narrow Arithemetic

Vector Reduction

ARMV7 vs ARMV8 Neon Instructions

Armv8 Porting (Jpeg Turbo IDCT code)
Contents
connect.linaro.org
ARMV8 Neon
 Neon is an optional extension to ARMV8
 Neon performs packed SIMD operations on either integer or floating
point (Single/double precision) data.
 Neon extension registers are distinct from ARM core register set and
contains
l
32 X 128 bit wide vector registers
l
32 X 64 bit wide vector registers which are held in lower 64 bit of
each 128 bit registers
connect.linaro.org
Vector Operations (Data type - byte)
– Char array1[8] = {1,2,3,4,5,6,7,8}
– Char array2[8] = {1,2,3,4,5,6,7,8}
– /* Load array1 V1.8B*/
– LD1 {V1.8B},[array1]
– /* Load array2 V2.8B*/
– LD1 {V2.8B},[array2]
– /* operations */
– MUL V1.8B,V1.8B,V2.8B
8 7 6 5 4 3 2 1
8 7 6 5 4 3 2 1 V2.8B
V1.8B
64 49 36 25 16 9 4 1 V1.8B
64 49 36 25 16 9 4 10 0 0 0 0 0 0 0 V1.16B
Vector Operations on ARMV8
connect.linaro.org
 Scalar Operations
l
Some SIMD instruction operate on scalar data and registers
holding these data behaves similar to main general purpose
registers.
l
Suffix n ->{registers from 0 to 31}
l
Unused higher bits are not accessed during read and set to zero
during write.
Scalar Operation on Neon
connect.linaro.org
 LD1,LD2,LD3,LD4 and ST1,ST2,ST3 & ST4 are some of the vector load and
store instructions
 The format of this instruction is -> it can contain 1 ~ 4 consecutive registers
to/from which consecutive 8,16,32,64 bit elements can be transferred.
 LD1 transfers data from memory to consecutive registers without any interleave
l
Load/Store Instruction Neon
connect.linaro.org
 LD2/St2 instruction loads elements to the specified registers/memory in
an interleaved fashion.
 The format of the instruction
ld2 {vn.<t>, vm.<t>}, [x1] t-> 8b,16b, 4h,8h,2s,4s,d,2d
for this particular instruction the rules are the registers should be consecutive
and the number of registers has to be 2 or 4.
l
0x10
0x11
0x10
0x11
0x10
0x11
0x100x10
0x110x11
ld2 {v0.2s,v1.2s}, [x0]
Load/Store Instruction Neon
connect.linaro.org
LD3/St3 instruction loads elements to the specified registers / memory with
2 data element interleaved .
The format of the instruction
ld3 {vi.<t>, vj.<t>,vk}, [x1] t-> 8b,16b, 4h,8h,2s,4s,d,2d
For this particular instruction the rules are registers should be consecutive
and the number of registers has to be 3.
optional space is allowed to support 128bit vector
Load/Store Instruction Neon
connect.linaro.org
 LD4/St4 instruction loads elements to the specified registers/memory with
3 data element interleaved .
 The format of the instruction-> ld3 {vi.<t>, vj.<t>,vk}, [x1] t-> 8b,16b,
4h,8h,2s,4s,d,2d
Instruction rules:
1. the registers should be consecutive and the number of registers has to be
4.
2. optional space is allowed to support 128bit vector
l
Load/Store Instruction Neon
0x10
0x11
0x12
0x13 0x100x10
0x110x11
v0.2s
0x12 0x12
v1.2s
v2.2s
0x10
0x11
0x12
0x13
0x13 0x13 v3.2s
ld4 {v0.2s,v1.2s,v2.2s,v3.2s}, [x0]
connect.linaro.org
Example RGB channel :
Example: Load 12 16 bit values from the address stored in R0, and split them over
64 bit registers V0.8B, V1.8B and V2.8B.
LD3 {v0.8B,V1.16B,V2.16B}, [X0]
l
Load/Store Instruction Neon
connect.linaro.org
Narrow,Long and Wide Instructions support
1. Long Instructions
SADDL Vd.<Td>, Vn.<Ts>, Vm.<Ts>
<Td>/<Ts> is 8H/8B, 4S/4H or 2D/2S.
2. Wide Instruction
SADDW Vd.<Td>, Vn.<Td>, Vm.<Ts>
<Td>/<Ts> is 8H/8B, 4S/4H or 2D/2S
3. Narrow Instruction -> add and narrow higher half
ADDHN Vd.<Td>, Vn.<Ts>, Vm.<Ts>
Narrow/Wide Neon Instructions
connect.linaro.org
Vector Reduce
1. addv <v>d, vn<T> (eg:- addv v15.[5]b, v5.8b )
<v>d is the destination scalar register (b,h,s,d)
Vn<T> is vector register of size T ( 8b/16b, 4h/8h, 2s/4s,d/2d)
2. saddlv <v>d, vn<T> (eg:- addv v15.[5]h, v5.8b )
<v>d is the destination scalar register (h,s,d)
Vn<T> is vector register of size T ( 8b/16b, 4h/8h, 2s/4s,d/2d)
l
Vector Reduction
connect.linaro.org
Difference in the mnemonics
ARMV7 ARMV8
1. vld1.16 {d1,d2,d3,d4}, [r0]
2. vsub.s32 q1,q1,q2
1. ld1 {v2.4h,v3.4h,v4.4h}, [x0]
2. ssub v1.4s,v1.4s,v1.4s
1. ARMV7 mnemonic prefix ‘v’ has been removed and replaced by s/u/f/p
which indicates the data type as signed/unsigned/float/polynomial
2. Vector organization ( element Size/ number of lanes) is actually described by
the register qualifiers and not by the mnemonics as in ARMV7.
3. suffix like V has been added to the mnemonic to indicate it’s a reduction type
operation, suffix 2 indicates the instruction is a widening/narrowing type
operating on the second half of the register.
Scalar Operation on Neon
connect.linaro.org
Armv8 Porting (Jpeg Turbo IDCT code)
1. IDCT ( integer & Fast algorithms) ARMV7 neon code availability in Jpeg Turbo.
2. Some of the instruction of ARMV7 doesnot exist in ARMV8 like vswap, push, pull
and has partial load pair instrucitons.
3. IDCT ARMV7 code takes advantage of complete register overlap of 64bit & 128bit
registers and this can increase the effort of porting.
Other methods of porting if you dont have an ARMv7 code
1. use of Intrinisics can be a good option but bit risky as it doesnot guarntee the
effective neon code generation as intended
2. can seperate the function of interest and generate armv8 neon code through
vectorizaiton option and start optimizing that.
l
Linaro
Connect,
Hong Kong
March 2013
Questions?
More about Linaro: http://www.linaro.org/about/
More about Linaro engineering: http://www.linaro.org/engineering/
How to join: http://www.linaro.org/about/how-to-join
Linaro members: www.linaro.org/members

More Related Content

Viewers also liked

LAS16-504: Secure Storage updates in OP-TEE
LAS16-504: Secure Storage updates in OP-TEELAS16-504: Secure Storage updates in OP-TEE
LAS16-504: Secure Storage updates in OP-TEELinaro
 
SFO15-503: Secure storage in OP-TEE
SFO15-503: Secure storage in OP-TEESFO15-503: Secure storage in OP-TEE
SFO15-503: Secure storage in OP-TEELinaro
 
Introduction to Optee (26 may 2016)
Introduction to Optee (26 may 2016)Introduction to Optee (26 may 2016)
Introduction to Optee (26 may 2016)Yannick Gicquel
 
BKK16-110 A Gentle Introduction to Trusted Execution and OP-TEE
BKK16-110 A Gentle Introduction to Trusted Execution and OP-TEEBKK16-110 A Gentle Introduction to Trusted Execution and OP-TEE
BKK16-110 A Gentle Introduction to Trusted Execution and OP-TEELinaro
 
LCU14-103: How to create and run Trusted Applications on OP-TEE
LCU14-103: How to create and run Trusted Applications on OP-TEELCU14-103: How to create and run Trusted Applications on OP-TEE
LCU14-103: How to create and run Trusted Applications on OP-TEELinaro
 
HKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting ReviewHKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting ReviewLinaro
 
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3Linaro
 
Arm v8 instruction overview android 64 bit briefing
Arm v8 instruction overview android 64 bit briefingArm v8 instruction overview android 64 bit briefing
Arm v8 instruction overview android 64 bit briefingMerck Hung
 
BUD17-DF15 - Optimized Android N MR1 + 4.9 Kernel
BUD17-DF15 - Optimized Android N MR1 + 4.9 KernelBUD17-DF15 - Optimized Android N MR1 + 4.9 Kernel
BUD17-DF15 - Optimized Android N MR1 + 4.9 KernelLinaro
 
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARMXPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARMThe Linux Foundation
 

Viewers also liked (10)

LAS16-504: Secure Storage updates in OP-TEE
LAS16-504: Secure Storage updates in OP-TEELAS16-504: Secure Storage updates in OP-TEE
LAS16-504: Secure Storage updates in OP-TEE
 
SFO15-503: Secure storage in OP-TEE
SFO15-503: Secure storage in OP-TEESFO15-503: Secure storage in OP-TEE
SFO15-503: Secure storage in OP-TEE
 
Introduction to Optee (26 may 2016)
Introduction to Optee (26 may 2016)Introduction to Optee (26 may 2016)
Introduction to Optee (26 may 2016)
 
BKK16-110 A Gentle Introduction to Trusted Execution and OP-TEE
BKK16-110 A Gentle Introduction to Trusted Execution and OP-TEEBKK16-110 A Gentle Introduction to Trusted Execution and OP-TEE
BKK16-110 A Gentle Introduction to Trusted Execution and OP-TEE
 
LCU14-103: How to create and run Trusted Applications on OP-TEE
LCU14-103: How to create and run Trusted Applications on OP-TEELCU14-103: How to create and run Trusted Applications on OP-TEE
LCU14-103: How to create and run Trusted Applications on OP-TEE
 
HKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting ReviewHKG15-311: OP-TEE for Beginners and Porting Review
HKG15-311: OP-TEE for Beginners and Porting Review
 
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
LAS16-111: Easing Access to ARM TrustZone – OP-TEE and Raspberry Pi 3
 
Arm v8 instruction overview android 64 bit briefing
Arm v8 instruction overview android 64 bit briefingArm v8 instruction overview android 64 bit briefing
Arm v8 instruction overview android 64 bit briefing
 
BUD17-DF15 - Optimized Android N MR1 + 4.9 Kernel
BUD17-DF15 - Optimized Android N MR1 + 4.9 KernelBUD17-DF15 - Optimized Android N MR1 + 4.9 Kernel
BUD17-DF15 - Optimized Android N MR1 + 4.9 Kernel
 
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARMXPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
XPDS16: Porting Xen on ARM to a new SOC - Julien Grall, ARM
 

More from Linaro

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloLinaro
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaLinaro
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraLinaro
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaLinaro
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018Linaro
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018Linaro
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...Linaro
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Linaro
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Linaro
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteLinaro
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopLinaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineLinaro
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allLinaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorLinaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMULinaro
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MLinaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootLinaro
 

More from Linaro (20)

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qa
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 

Recently uploaded

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 

Recently uploaded (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 

LCE13: Porting NEON from armv7 to armv8, an experience report

  • 1. Linaro Connect, Hong Kong March 2013 LCE13, Dublin. 8-12 July 2013 Porting - ARMV8 Neon
  • 2. connect.linaro.org  ARMV8 Neon  Vector operations on ARMV8  Scalar operation on ARMV8  Neon load/store instructions  Load/Store illustration  Widening/Narrow Arithemetic  Vector Reduction  ARMV7 vs ARMV8 Neon Instructions  Armv8 Porting (Jpeg Turbo IDCT code) Contents
  • 3. connect.linaro.org ARMV8 Neon  Neon is an optional extension to ARMV8  Neon performs packed SIMD operations on either integer or floating point (Single/double precision) data.  Neon extension registers are distinct from ARM core register set and contains l 32 X 128 bit wide vector registers l 32 X 64 bit wide vector registers which are held in lower 64 bit of each 128 bit registers
  • 4. connect.linaro.org Vector Operations (Data type - byte) – Char array1[8] = {1,2,3,4,5,6,7,8} – Char array2[8] = {1,2,3,4,5,6,7,8} – /* Load array1 V1.8B*/ – LD1 {V1.8B},[array1] – /* Load array2 V2.8B*/ – LD1 {V2.8B},[array2] – /* operations */ – MUL V1.8B,V1.8B,V2.8B 8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 V2.8B V1.8B 64 49 36 25 16 9 4 1 V1.8B 64 49 36 25 16 9 4 10 0 0 0 0 0 0 0 V1.16B Vector Operations on ARMV8
  • 5. connect.linaro.org  Scalar Operations l Some SIMD instruction operate on scalar data and registers holding these data behaves similar to main general purpose registers. l Suffix n ->{registers from 0 to 31} l Unused higher bits are not accessed during read and set to zero during write. Scalar Operation on Neon
  • 6. connect.linaro.org  LD1,LD2,LD3,LD4 and ST1,ST2,ST3 & ST4 are some of the vector load and store instructions  The format of this instruction is -> it can contain 1 ~ 4 consecutive registers to/from which consecutive 8,16,32,64 bit elements can be transferred.  LD1 transfers data from memory to consecutive registers without any interleave l Load/Store Instruction Neon
  • 7. connect.linaro.org  LD2/St2 instruction loads elements to the specified registers/memory in an interleaved fashion.  The format of the instruction ld2 {vn.<t>, vm.<t>}, [x1] t-> 8b,16b, 4h,8h,2s,4s,d,2d for this particular instruction the rules are the registers should be consecutive and the number of registers has to be 2 or 4. l 0x10 0x11 0x10 0x11 0x10 0x11 0x100x10 0x110x11 ld2 {v0.2s,v1.2s}, [x0] Load/Store Instruction Neon
  • 8. connect.linaro.org LD3/St3 instruction loads elements to the specified registers / memory with 2 data element interleaved . The format of the instruction ld3 {vi.<t>, vj.<t>,vk}, [x1] t-> 8b,16b, 4h,8h,2s,4s,d,2d For this particular instruction the rules are registers should be consecutive and the number of registers has to be 3. optional space is allowed to support 128bit vector Load/Store Instruction Neon
  • 9. connect.linaro.org  LD4/St4 instruction loads elements to the specified registers/memory with 3 data element interleaved .  The format of the instruction-> ld3 {vi.<t>, vj.<t>,vk}, [x1] t-> 8b,16b, 4h,8h,2s,4s,d,2d Instruction rules: 1. the registers should be consecutive and the number of registers has to be 4. 2. optional space is allowed to support 128bit vector l Load/Store Instruction Neon 0x10 0x11 0x12 0x13 0x100x10 0x110x11 v0.2s 0x12 0x12 v1.2s v2.2s 0x10 0x11 0x12 0x13 0x13 0x13 v3.2s ld4 {v0.2s,v1.2s,v2.2s,v3.2s}, [x0]
  • 10. connect.linaro.org Example RGB channel : Example: Load 12 16 bit values from the address stored in R0, and split them over 64 bit registers V0.8B, V1.8B and V2.8B. LD3 {v0.8B,V1.16B,V2.16B}, [X0] l Load/Store Instruction Neon
  • 11. connect.linaro.org Narrow,Long and Wide Instructions support 1. Long Instructions SADDL Vd.<Td>, Vn.<Ts>, Vm.<Ts> <Td>/<Ts> is 8H/8B, 4S/4H or 2D/2S. 2. Wide Instruction SADDW Vd.<Td>, Vn.<Td>, Vm.<Ts> <Td>/<Ts> is 8H/8B, 4S/4H or 2D/2S 3. Narrow Instruction -> add and narrow higher half ADDHN Vd.<Td>, Vn.<Ts>, Vm.<Ts> Narrow/Wide Neon Instructions
  • 12. connect.linaro.org Vector Reduce 1. addv <v>d, vn<T> (eg:- addv v15.[5]b, v5.8b ) <v>d is the destination scalar register (b,h,s,d) Vn<T> is vector register of size T ( 8b/16b, 4h/8h, 2s/4s,d/2d) 2. saddlv <v>d, vn<T> (eg:- addv v15.[5]h, v5.8b ) <v>d is the destination scalar register (h,s,d) Vn<T> is vector register of size T ( 8b/16b, 4h/8h, 2s/4s,d/2d) l Vector Reduction
  • 13. connect.linaro.org Difference in the mnemonics ARMV7 ARMV8 1. vld1.16 {d1,d2,d3,d4}, [r0] 2. vsub.s32 q1,q1,q2 1. ld1 {v2.4h,v3.4h,v4.4h}, [x0] 2. ssub v1.4s,v1.4s,v1.4s 1. ARMV7 mnemonic prefix ‘v’ has been removed and replaced by s/u/f/p which indicates the data type as signed/unsigned/float/polynomial 2. Vector organization ( element Size/ number of lanes) is actually described by the register qualifiers and not by the mnemonics as in ARMV7. 3. suffix like V has been added to the mnemonic to indicate it’s a reduction type operation, suffix 2 indicates the instruction is a widening/narrowing type operating on the second half of the register. Scalar Operation on Neon
  • 14. connect.linaro.org Armv8 Porting (Jpeg Turbo IDCT code) 1. IDCT ( integer & Fast algorithms) ARMV7 neon code availability in Jpeg Turbo. 2. Some of the instruction of ARMV7 doesnot exist in ARMV8 like vswap, push, pull and has partial load pair instrucitons. 3. IDCT ARMV7 code takes advantage of complete register overlap of 64bit & 128bit registers and this can increase the effort of porting. Other methods of porting if you dont have an ARMv7 code 1. use of Intrinisics can be a good option but bit risky as it doesnot guarntee the effective neon code generation as intended 2. can seperate the function of interest and generate armv8 neon code through vectorizaiton option and start optimizing that. l
  • 16. More about Linaro: http://www.linaro.org/about/ More about Linaro engineering: http://www.linaro.org/engineering/ How to join: http://www.linaro.org/about/how-to-join Linaro members: www.linaro.org/members