This article implements CPU data cache operations in arm64 assembly language. Data cache operations should normally use the Linux Kernel API, but unfortunately nothing was available (more on this later). Please note that this article is just a trial article that I tried a little.
When exchanging data between the PS (Processing System) part and the PL (Programmable Logic) part with ZynqMP (ARM64), there is a method to prepare a memory such as BRAM on the PL side and access it from the CPU of the PS part.
This article describes how to access memory on the PL side from Linux so that the following conditions are met:
If you just want to access it normally, you can use uio. However, uio cannot enable the data cache of the CPU under condition 1, which is disadvantageous in terms of performance when transferring a large amount of data.
Also, with the method using / dev / mem and reserved_memory shown in the reference ["Accessing BRAM In Linux"], although the data cache can be enabled, the cache operation cannot be performed manually, so the data with the PL side Not suitable for interaction. Also, reserved_memory can only be specified when Linux boots, so it cannot be freely attached or detached after Linux boots.
The author publishes [udmabuf] as open source.
I tried to add a function to udmabuf on a trial basis so that the memory on the PL side can be accessed from Linux (it seems to be crap, but it is just a trial). Then, I implemented a sample design using BRAM in the memory on the PL side and confirmed the effect of the data cache.
This article describes the following:
In this chapter, we will actually measure and show what effect the data cache has when accessing the memory on the PL side from the PS side.
The environment used for the measurement is as follows.
Ultra96-V2
[ZynqMP-FPGA-Linux v2019.2.1]
linux-xlnx v2019.2 (Linux Kernel 4.19)
Debian10
Xilinx Vivado 2019.2
[udmabuf v2.2.0-rc2]
Implement the following design on the PL side. 256KByte of memory is implemented in BRAM on the PL side, and Xilinx's AXI BRAM Controller is used for the interface. The operating frequency is 100MHz. ILA (Integrated Logic Analyzer) is connected to observe the AXI I / F of AXI BRAM Controller and the waveform of BRAM I / F.
Fig.1 Block diagram of PLBRAM-Ultra96
These environments are published on github.
It took 0.496 msec to turn off the data cache and use memcpy () to write 256KByte of data to the BRAM on the PL side. The write speed is about 528MByte / sec.
The AXI I / F waveform at that time is as follows.
Fig.2 AXI IF waveform of memory write when data cache is off
As you can see from the waveform, there is no burst transfer (AWLEN = 00). You can see that it is transferring one word (16 bytes) at a time.
It took 0.317 msec to turn on the data cache and use memcpy () to write 256KByte of data to the BRAM on the PL side. The write speed is about 827 MByte / sec.
The AXI I / F waveform at that time is as follows.
Fig.3 AXI IF waveform of memory write when data cache is on
As you can see from the waveform, a burst transfer of 4 words (64 bytes) is performed in one write (AWLEN = 03).
Writes to the BRAM do not occur when the CPU writes. When the CPU writes, the data is first written to the data cache and not yet written to the BRAM. Then, the BRAM is written only when the data cache flush instruction is manually executed, or when the data cache is full and the unused cache is freed. At that time, writing is performed collectively for each cache line size of the data cache (64 bytes for arm64).
It took 3.485 msec to turn off the data cache and use memcpy () to read 256KByte of data from the BRAM on the PL side. The read speed is about 75MByte / sec.
The AXI I / F waveform at that time is as follows.
Fig.4 AXI IF waveform of memory read when data cache is off
As you can see from the waveform, there is no burst transfer (ARLEN = 00). You can see that it is transferring one word (16 bytes) at a time.
It took 0.409 msec to turn on the data cache and use memcpy () to read 256KByte of data from the BRAM on the PL side. The read speed is about 641MByte / sec.
The AXI I / F waveform at that time is as follows.
Fig.5 AXI IF waveform of memory read when data cache is on
As you can see from the waveform, a burst transfer of 4 words (64 bytes) is performed in one read (ARLEN = 03).
When the CPU reads the memory, if there is no data in the data cache, it reads the data from the BRAM and fills the cache. At that time, the cache line size (64 bytes for arm64) of the data cache is collectively read from BRAM. After that, as long as there is data in the data cache, the data will be provided to the CPU from the data cache and no access to BRAM will occur. Therefore, the memory read is faster than when the data cache is off. In this environment, when the data cache is off, the performance is significantly improved to 641MByte / sec when the data cache is turned on for 75MByte / sec.
Linux has a framework called Dynamic DMA mapping (dma-mapping). The Linux Kernel typically uses virtual addresses. On the other hand, if the device supports DMA, it usually requires a physical address. dma-mapping is a framework for bridging the Linux Kernel with devices that support DMA, allocating and managing DMA buffers, translating physical and virtual addresses to each other, and managing data caches if necessary. .. For details, refer to DMA-API-HOWTO of Linux Kernel.
In dma-mapping, when dma_alloc_coherent () allocates a DMA buffer, it usually allocates a DMA buffer in memory in the Linux Kernel. Since the memory on the PL side is not on the memory in the Linux Kernel, it is necessary to devise a way to secure a DMA buffer in the memory on the PL side.
One way to allocate a DMA buffer in the memory on the PL side is to use the reserved-memory of the Device Tree. See below for how to use reserved-memory.
The method using reserved-memory cannot support Device Tree Overlay. This is because reserved-memory cannot be reconfigured once it is configured at Linux Kernel boot. reserved-memory is not covered by Device Tree Overlay.
Normally, when dma_alloc_coherent () allocates a DMA buffer, it allocates a DMA buffer in memory in the Linux Kernel. However, a mechanism called device coherent pool can be used to allocate a DMA buffer from memory outside the Linux Kernel. The source code for the device coherent pool can be found in kernel / dma / coherent.c.
dma_declare_coherent_memory()
To use the device coherent pool, first use the function dma_declare_coherent_memory (). dma_declare_coherent_memory () looks like this:
kernel/dma/coherent.c
int dma_declare_coherent_memory(struct device *dev, phys_addr_t phys_addr,
dma_addr_t device_addr, size_t size, int flags)
{
struct dma_coherent_mem *mem;
int ret;
ret = dma_init_coherent_memory(phys_addr, device_addr, size, flags, &mem);
if (ret)
return ret;
ret = dma_assign_coherent_memory(dev, mem);
if (ret)
dma_release_coherent_memory(mem);
return ret;
}
EXPORT_SYMBOL(dma_declare_coherent_memory);
Specify the physical address of the memory you want to allocate to phys_addr, the address on the device for device_addr, and the size of the memory you want to allocate for size.
Initialize the buffer with dma_init_cohrent_memory () and assign it to dev in the device structure with dma_assign_coherent_memory ().
Now, if you use dma_alloc_coherent () to allocate a DMA buffer for dev, the DMA buffer will be allocated from the memory specified by dma_declare_coherent_memory ().
dma_alloc_coherent()
Specifically, the mechanism by which dma_alloc_coherent () allocates the DMA buffer from the memory area allocated by dma_declare_coherent_memory () is explained. dma_alloc_coherent () is defined in include / linux / dma-mapping.h as follows:
include/linux/dma-mapping.h
static inline void *dma_alloc_attrs(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t flag,
unsigned long attrs)
{
const struct dma_map_ops *ops = get_dma_ops(dev);
void *cpu_addr;
BUG_ON(!ops);
WARN_ON_ONCE(dev && !dev->coherent_dma_mask);
if (dma_alloc_from_dev_coherent(dev, size, dma_handle, &cpu_addr))
return cpu_addr;
/* let the implementation decide on the zone to allocate from: */
flag &= ~(__GFP_DMA | __GFP_DMA32 | __GFP_HIGHMEM);
if (!arch_dma_alloc_attrs(&dev))
return NULL;
if (!ops->alloc)
return NULL;
cpu_addr = ops->alloc(dev, size, dma_handle, flag, attrs);
debug_dma_alloc_coherent(dev, size, *dma_handle, cpu_addr);
return cpu_addr;
}
(Omission)
static inline void *dma_alloc_coherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t flag)
{
return dma_alloc_attrs(dev, size, dma_handle, flag, 0);
}
The dma_alloc_attrs () called by dma_alloc_coherent () is calling dma_alloc_from_dev_coherent () earlier. dma_alloc_from_dev_coherent () is defined in kernel / dma / coherent.c as follows:
kernel/dma/coherent.c
static inline struct dma_coherent_mem *dev_get_coherent_memory(struct device *dev)
{
if (dev && dev->dma_mem)
return dev->dma_mem;
return NULL;
}
(Omission)
/**
* dma_alloc_from_dev_coherent() - allocate memory from device coherent pool
* @dev: device from which we allocate memory
* @size: size of requested memory area
* @dma_handle: This will be filled with the correct dma handle
* @ret: This pointer will be filled with the virtual address
* to allocated area.
*
* This function should be only called from per-arch dma_alloc_coherent()
* to support allocation from per-device coherent memory pools.
*
* Returns 0 if dma_alloc_coherent should continue with allocating from
* generic memory areas, or !0 if dma_alloc_coherent should return @ret.
*/
int dma_alloc_from_dev_coherent(struct device *dev, ssize_t size,
dma_addr_t *dma_handle, void **ret)
{
struct dma_coherent_mem *mem = dev_get_coherent_memory(dev);
if (!mem)
return 0;
*ret = __dma_alloc_from_coherent(mem, size, dma_handle);
if (*ret)
return 1;
/*
* In the case where the allocation can not be satisfied from the
* per-device area, try to fall back to generic memory if the
* constraints allow it.
*/
return mem->flags & DMA_MEMORY_EXCLUSIVE;
}
EXPORT_SYMBOL(dma_alloc_from_dev_coherent);
dma_alloc_from_dev_coherent () first calls dev_get_coherent_memory () to check for dma_mem in the device structure. When dma_mem is NULL, it returns without doing anything, but if device coherent pool is assigned to dma_mem by dma_declare_coherent_memory (), \ _dma_alloc_from_coherent () allocates a DMA buffer from the device coherent pool.
dma_release_declared_memory()
The device coherent pool assigned to the device structure with dma_declare_coherent_memory () is released with dma_release_declared_memory ().
kernel/dma/coherent.c
void dma_release_declared_memory(struct device *dev)
{
struct dma_coherent_mem *mem = dev->dma_mem;
if (!mem)
return;
dma_release_coherent_memory(mem);
dev->dma_mem = NULL;
}
EXPORT_SYMBOL(dma_release_declared_memory);
dma_mmap_from_dev_coherent()
Use dma_mmap_from_dev_coherent () to map the allocated DMA buffer to user space. Normally, dma_mmap_cohrent () is used to map the DMA buffer to user space, but dma_mmap_from_dev_cohrent () can be used to map with the data cache enabled.
dma_mmap_from_dev_coherent () is defined as follows: Notice that dma_mmap_from_dev_coherent () has not made any changes to vma-> vm_page_prot.
kernel/dma/coherent.c
static int __dma_mmap_from_coherent(struct dma_coherent_mem *mem,
struct vm_area_struct *vma, void *vaddr, size_t size, int *ret)
{
if (mem && vaddr >= mem->virt_base && vaddr + size <=
(mem->virt_base + (mem->size << PAGE_SHIFT))) {
unsigned long off = vma->vm_pgoff;
int start = (vaddr - mem->virt_base) >> PAGE_SHIFT;
int user_count = vma_pages(vma);
int count = PAGE_ALIGN(size) >> PAGE_SHIFT;
*ret = -ENXIO;
if (off < count && user_count <= count - off) {
unsigned long pfn = mem->pfn_base + start + off;
*ret = remap_pfn_range(vma, vma->vm_start, pfn,
user_count << PAGE_SHIFT,
vma->vm_page_prot);
}
return 1;
}
return 0;
}
/**
* dma_mmap_from_dev_coherent() - mmap memory from the device coherent pool
* @dev: device from which the memory was allocated
* @vma: vm_area for the userspace memory
* @vaddr: cpu address returned by dma_alloc_from_dev_coherent
* @size: size of the memory buffer allocated
* @ret: result from remap_pfn_range()
*
* This checks whether the memory was allocated from the per-device
* coherent memory pool and if so, maps that memory to the provided vma.
*
* Returns 1 if @vaddr belongs to the device coherent pool and the caller
* should return @ret, or 0 if they should proceed with mapping memory from
* generic areas.
*/
int dma_mmap_from_dev_coherent(struct device *dev, struct vm_area_struct *vma,
void *vaddr, size_t size, int *ret)
{
struct dma_coherent_mem *mem = dev_get_coherent_memory(dev);
return __dma_mmap_from_coherent(mem, vma, vaddr, size, ret);
}
EXPORT_SYMBOL(dma_mmap_from_dev_coherent);
On the other hand, you typically use dma_mmap_cohrent () to map the DMA buffer to user space. dma_mmap_coherent () is defined in include / linux / dma-mapping.h as follows:
include/linux/dma-mapping.h
/**
* dma_mmap_attrs - map a coherent DMA allocation into user space
* @dev: valid struct device pointer, or NULL for ISA and EISA-like devices
* @vma: vm_area_struct describing requested user mapping
* @cpu_addr: kernel CPU-view address returned from dma_alloc_attrs
* @handle: device-view address returned from dma_alloc_attrs
* @size: size of memory originally requested in dma_alloc_attrs
* @attrs: attributes of mapping properties requested in dma_alloc_attrs
*
* Map a coherent DMA buffer previously allocated by dma_alloc_attrs
* into user space. The coherent DMA buffer must not be freed by the
* driver until the user space mapping has been released.
*/
static inline int
dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma, void *cpu_addr,
dma_addr_t dma_addr, size_t size, unsigned long attrs)
{
const struct dma_map_ops *ops = get_dma_ops(dev);
BUG_ON(!ops);
if (ops->mmap)
return ops->mmap(dev, vma, cpu_addr, dma_addr, size, attrs);
return dma_common_mmap(dev, vma, cpu_addr, dma_addr, size);
}
#define dma_mmap_coherent(d, v, c, h, s) dma_mmap_attrs(d, v, c, h, s, 0)
dma_mmap_attrs () calls mmap () for architecture-dependent dma_map_ops. The mmap () for arm64 looks like this:
arch/arm64/mm/dma-mapping.c
static const struct dma_map_ops arm64_swiotlb_dma_ops = {
.alloc = __dma_alloc,
.free = __dma_free,
.mmap = __swiotlb_mmap,
.get_sgtable = __swiotlb_get_sgtable,
.map_page = __swiotlb_map_page,
.unmap_page = __swiotlb_unmap_page,
.map_sg = __swiotlb_map_sg_attrs,
.unmap_sg = __swiotlb_unmap_sg_attrs,
.sync_single_for_cpu = __swiotlb_sync_single_for_cpu,
.sync_single_for_device = __swiotlb_sync_single_for_device,
.sync_sg_for_cpu = __swiotlb_sync_sg_for_cpu,
.sync_sg_for_device = __swiotlb_sync_sg_for_device,
.dma_supported = __swiotlb_dma_supported,
.mapping_error = __swiotlb_dma_mapping_error,
};
(Omitted)
arch/arm64/mm/dma-mapping.c
static int __swiotlb_mmap(struct device *dev,
struct vm_area_struct *vma,
void *cpu_addr, dma_addr_t dma_addr, size_t size,
unsigned long attrs)
{
int ret;
unsigned long pfn = dma_to_phys(dev, dma_addr) >> PAGE_SHIFT;
vma->vm_page_prot = __get_dma_pgprot(attrs, vma->vm_page_prot,
is_device_dma_coherent(dev));
if (dma_mmap_from_dev_coherent(dev, vma, cpu_addr, size, &ret))
return ret;
return __swiotlb_mmap_pfn(vma, pfn, size);
}
Note that \ _swiotlb_mmap () also calls dma_mmap_from_dev_cohrent () once, but before that it uses \ _get_dma_pgprot () to overwrite vma-> vm_page_prot.
And \ _get_dma_pgprot () looks like this:
arch/arm64/mm/dma-mapping.c
static pgprot_t __get_dma_pgprot(unsigned long attrs, pgprot_t prot,
bool coherent)
{
if (!coherent || (attrs & DMA_ATTR_WRITE_COMBINE))
return pgprot_writecombine(prot);
return prot;
}
The data cache has been turned off by pgprot_writecombine ().
That is, when mapping the DMA buffer to user space, call dma_mmap_from_dev_cohrent () directly instead of calling dma_mmap_coherent (), which forces the data cache to be disabled.
If you only want to use the memory on the PL side as memory that can only be accessed from the CPU, you only need to enable the data cache. However, enabling the data cache is not enough for devices other than the CPU to access the memory on the PL side, or to enable or disable the memory on the PL side after starting Linux. Since data mismatch between the data cache and the memory on the PL side can occur, it is necessary to match the contents of the memory on the data cache and the memory on the PL side in some way.
include/linux/dma-mapping.h
dma-mapping provides an API to force a match between the contents of this data cache and memory. These are dma_sync_single_for_cpu () and dma_sync_single_for_device ().
include/linux/dma-mapping.h
static inline void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr,
size_t size,
enum dma_data_direction dir)
{
const struct dma_map_ops *ops = get_dma_ops(dev);
BUG_ON(!valid_dma_direction(dir));
if (ops->sync_single_for_cpu)
ops->sync_single_for_cpu(dev, addr, size, dir);
debug_dma_sync_single_for_cpu(dev, addr, size, dir);
}
static inline void dma_sync_single_for_device(struct device *dev,
dma_addr_t addr, size_t size,
enum dma_data_direction dir)
{
const struct dma_map_ops *ops = get_dma_ops(dev);
BUG_ON(!valid_dma_direction(dir));
if (ops->sync_single_for_device)
ops->sync_single_for_device(dev, addr, size, dir);
debug_dma_sync_single_for_device(dev, addr, size, dir);
}
Unfortunately, when I run this function against the DMA buffer allocated by the device coherent pool, Linux Kenrel raises a Panic with a message similar to the following: It is said that the virtual address is strange.
dmesg
[ 141.582982] Unable to handle kernel paging request at virtual address ffffffc400000000
[ 141.590907] Mem abort info:
[ 141.593725] ESR = 0x96000145
[ 141.596767] Exception class = DABT (current EL), IL = 32 bits
[ 141.602686] SET = 0, FnV = 0
[ 141.605741] EA = 0, S1PTW = 0
[ 141.608872] Data abort info:
[ 141.611748] ISV = 0, ISS = 0x00000145
[ 141.615584] CM = 1, WnR = 1
[ 141.618552] swapper pgtable: 4k pages, 39-bit VAs, pgdp = 000000005fbae591
[ 141.627503] [ffffffc400000000] pgd=0000000000000000, pud=0000000000000000
[ 141.634294] Internal error: Oops: 96000145 [#1] SMP
[ 141.642892] Modules linked in: fclkcfg(O) u_dma_buf(O) mali(O) uio_pdrv_genirq
[ 141.650118] CPU: 0 PID: 3888 Comm: plbram_test Tainted: G O 4.19.0-xlnx-v2019.2-zynqmp-fpga #2
[ 141.660017] Hardware name: Avnet Ultra96-V2 Rev1 (DT)
[ 141.665053] pstate: 40000005 (nZcv daif -PAN -UAO)
[ 141.669839] pc : __dma_inv_area+0x40/0x58
[ 141.673838] lr : __swiotlb_sync_single_for_cpu+0x4c/0x70
[ 141.679138] sp : ffffff8010bdbc50
[ 141.682437] x29: ffffff8010bdbc50 x28: ffffffc06d1e2c40
[ 141.691811] x27: 0000000000000000 x26: 0000000000000000
[ 141.697114] x25: 0000000056000000 x24: 0000000000000015
[ 141.702418] x23: 0000000000000013 x22: ffffffc06abb5c80
[ 141.707721] x21: 0000000000040000 x20: 0000000400000000
[ 141.713025] x19: ffffffc06a932c10 x18: 0000000000000000
[ 141.718328] x17: 0000000000000000 x16: 0000000000000000
[ 141.723632] x15: 0000000000000000 x14: 0000000000000000
[ 141.728935] x13: 0000000000000000 x12: 0000000000000000
[ 141.734239] x11: ffffff8010bdbcd0 x10: ffffffc06dba2602
[ 141.739542] x9 : ffffff8008f48648 x8 : 0000000000000010
[ 141.744846] x7 : 00000000ffffffc9 x6 : 0000000000000010
[ 141.750149] x5 : 0000000400000000 x4 : 0000000400000000
[ 141.755452] x3 : 000000000000003f x2 : 0000000000000040
[ 141.760756] x1 : ffffffc400040000 x0 : ffffffc400000000
[ 141.766062] Process plbram_test (pid: 3888, stack limit = 0x0000000037d4fe7f)
[ 141.773187] Call trace:
[ 141.775620] __dma_inv_area+0x40/0x58
[ 141.779280] udmabuf_set_sync_for_cpu+0x10c/0x148 [u_dma_buf]
[ 141.785013] dev_attr_store+0x18/0x28
[ 141.788668] sysfs_kf_write+0x3c/0x50
[ 141.792319] kernfs_fop_write+0x118/0x1e0
[ 141.796313] __vfs_write+0x30/0x168
[ 141.799791] vfs_write+0xa4/0x1a8
[ 141.803090] ksys_write+0x60/0xd8
[ 141.806389] __arm64_sys_write+0x18/0x20
[ 141.810297] el0_svc_common+0x60/0xe8
[ 141.813949] el0_svc_handler+0x68/0x80
[ 141.817683] el0_svc+0x8/0xc
[ 141.820558] Code: 8a230000 54000060 d50b7e20 14000002 (d5087620)
[ 141.826642] ---[ end trace 3084524689d96f4d ]---
For historical reasons, the dma-mapping API handles the physical address on the DMA device called dma_addr_t as the address passed as an argument.
On the other hand, arm64 has an instruction set for handling the data cache, but the address handled by that instruction is a virtual address.
dma_sync_single_for_cpu () and dma_sync_single_for_device () each call an architecture-dependent subordinate function. For arm64, \ _swiotlb_sync_single_for_cpu () and \ _swiotlb_sync_single_for_device () are finally called.
arch/arm64/mm/dma-mapping.c
static void __swiotlb_sync_single_for_cpu(struct device *dev,
dma_addr_t dev_addr, size_t size,
enum dma_data_direction dir)
{
if (!is_device_dma_coherent(dev))
__dma_unmap_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
swiotlb_sync_single_for_cpu(dev, dev_addr, size, dir);
}
static void __swiotlb_sync_single_for_device(struct device *dev,
dma_addr_t dev_addr, size_t size,
enum dma_data_direction dir)
{
swiotlb_sync_single_for_device(dev, dev_addr, size, dir);
if (!is_device_dma_coherent(dev))
__dma_map_area(phys_to_virt(dma_to_phys(dev, dev_addr)), size, dir);
}
The \ _dma_unmap_area () and \ _dma_map_area () called by each function are cache control programs written in assembly language in arch / arm64 / mm / cache.S and execute the data cache control instructions of arm64.
As explained earlier, the address handled by the arm64 data cache instruction is a virtual address, so phys_to_virt () is called in _swiotlb_sync_single_for_cpu () and \ _swiotlb_sync_single_for_device () to translate from a physical address to a virtual address.
phys_to_virt () is defined in arch / arm64 / include / asm / memory.h.
arch/arm64/include/asm/memory.h
#ifdef CONFIG_DEBUG_VIRTUAL
extern phys_addr_t __virt_to_phys(unsigned long x);
extern phys_addr_t __phys_addr_symbol(unsigned long x);
#else
#define __virt_to_phys(x) __virt_to_phys_nodebug(x)
#define __phys_addr_symbol(x) __pa_symbol_nodebug(x)
#endif
#define __phys_to_virt(x) ((unsigned long)((x) - PHYS_OFFSET) | PAGE_OFFSET)
#define __phys_to_kimg(x) ((unsigned long)((x) + kimage_voffset))
/*
* Convert a page to/from a physical address
*/
#define page_to_phys(page) (__pfn_to_phys(page_to_pfn(page)))
#define phys_to_page(phys) (pfn_to_page(__phys_to_pfn(phys)))
/*
* Note: Drivers should NOT use these. They are the wrong
* translation for translating DMA addresses. Use the driver
* DMA support - see dma-mapping.h.
*/
#define virt_to_phys virt_to_phys
static inline phys_addr_t virt_to_phys(const volatile void *x)
{
return __virt_to_phys((unsigned long)(x));
}
#define phys_to_virt phys_to_virt
static inline void *phys_to_virt(phys_addr_t x)
{
return (void *)(__phys_to_virt(x));
}
As you can see, converting a physical address to a virtual address is simply ORing PAGE_OFFSET by subtracting PHYS_OFFSET.
In fact, this conversion works well for the memory space that the Linux Kernel first loaded into memory and allocated at initialization. However, other memory spaces (for example, when using the memory on the PL side as a DMA buffer as in this example), this conversion does not work. Therefore, it seems that the CPU raised an exception by specifying the wrong virtual address in the data cache operation instruction.
It turned out that the dma-mapping API cannot control the data cache when the memory on the PL side is allocated as a DMA buffer. I searched for various methods, but couldn't find a good one. Since there is no help for it, udmabuf v2.2.0-rc2 implements data cache control directly using the arm64 data cache instruction.
u-dma-buf.c
#if ((USE_IORESOURCE_MEM == 1) && defined(CONFIG_ARM64))
/**
* DOC: Data Cache Clean/Invalid for arm64 architecture.
*
* This section defines mem_sync_sinfle_for_cpu() and mem_sync_single_for_device().
*
* * arm64_read_dcache_line_size() - read data cache line size of arm64.
* * arm64_inval_dcache_area() - invalid data cache.
* * arm64_clean_dcache_area() - clean(flush and invalidiate) data cache.
* * mem_sync_single_for_cpu() - sync_single_for_cpu() for mem_resource.
* * mem_sync_single_for_device() - sync_single_for_device() for mem_resource.
*/
static inline u64 arm64_read_dcache_line_size(void)
{
u64 ctr;
u64 dcache_line_size;
const u64 bytes_per_word = 4;
asm volatile ("mrs %0, ctr_el0" : "=r"(ctr) : : );
asm volatile ("nop" : : : );
dcache_line_size = (ctr >> 16) & 0xF;
return (bytes_per_word << dcache_line_size);
}
static inline void arm64_inval_dcache_area(void* start, size_t size)
{
u64 vaddr = (u64)start;
u64 __end = (u64)start + size;
u64 cache_line_size = arm64_read_dcache_line_size();
u64 cache_line_mask = cache_line_size - 1;
if ((__end & cache_line_mask) != 0) {
__end &= ~cache_line_mask;
asm volatile ("dc civac, %0" : : "r"(__end) : );
}
if ((vaddr & cache_line_mask) != 0) {
vaddr &= ~cache_line_mask;
asm volatile ("dc civac, %0" : : "r"(vaddr) : );
}
while (vaddr < __end) {
asm volatile ("dc ivac, %0" : : "r"(vaddr) : );
vaddr += cache_line_size;
}
asm volatile ("dsb sy" : : : );
}
static inline void arm64_clean_dcache_area(void* start, size_t size)
{
u64 vaddr = (u64)start;
u64 __end = (u64)start + size;
u64 cache_line_size = arm64_read_dcache_line_size();
u64 cache_line_mask = cache_line_size - 1;
vaddr &= ~cache_line_mask;
while (vaddr < __end) {
asm volatile ("dc cvac, %0" : : "r"(vaddr) : );
vaddr += cache_line_size;
}
asm volatile ("dsb sy" : : : );
}
static void mem_sync_single_for_cpu(struct device* dev, void* start, size_t size, enum dma_data_direction direction)
{
if (is_device_dma_coherent(dev))
return;
if (direction != DMA_TO_DEVICE)
arm64_inval_dcache_area(start, size);
}
static void mem_sync_single_for_device(struct device* dev, void* start, size_t size, enum dma_data_direction direction)
{
if (is_device_dma_coherent(dev))
return;
if (direction == DMA_FROM_DEVICE)
arm64_inval_dcache_area(start, size);
else
arm64_clean_dcache_area(start, size);
}
#endif
If the memory on the PL side is allocated as a DMA buffer, sync_for_cpu and sync_for_device call mem_sync_single_for_cpu () and mem_sync_single_for_device (), respectively.
I thought that it was fairly common to implement a memory (BRAM in this example) or DRAM controller on the PL side and use that memory from Linux, but I tried to implement it considering the data cache in earnest. Then I felt that it was unexpectedly difficult.
Especially Kernel Panic made me cry. I didn't think there was such a trap in translating from a physical address to a virtual address. The dma-mapping API has a long history and may no longer fit into the current architecture.
Personally, I wish I had published an API for data cache operations using virtual addresses. There should be other uses as well. For example, at the end of arch / arm64 / mm / flush.c, there is the following description.
arch/arm64/mm/flush.c
#ifdef CONFIG_ARCH_HAS_PMEM_API
void arch_wb_cache_pmem(void *addr, size_t size)
{
/* Ensure order against any prior non-cacheable writes */
dmb(osh);
__clean_dcache_area_pop(addr, size);
}
EXPORT_SYMBOL_GPL(arch_wb_cache_pmem);
void arch_invalidate_pmem(void *addr, size_t size)
{
__inval_dcache_area(addr, size);
}
EXPORT_SYMBOL_GPL(arch_invalidate_pmem);
#endif
If the CONFIG_ARCH_HAS_PMEM_API is defined, the function for the data cache operation I wanted will be exported (EXPORT). This API seems to be provided for non-volatile memory (Persistent MEMory).
["Device driver for programs running in user space on Linux and hardware sharing memory" @Qiita]: https://qiita.com/ikwzm/items/cc1bb33ff43a491440ea "" Programs running in user space on Linux Device driver for hardware to share memory with @Qiita " ["Device driver for memory sharing between programs and hardware running in user space on Linux (reserved-memory)" @Qiita]: https://qiita.com/ikwzm/items/9b5fac2c1332147e76a8 "" On Linux Device driver for programs and hardware running in user space to share memory (reserved-memory edition) ”@Qiita" [udmabuf]: https://github.com/ikwzm/udmabuf "udmabuf" [udmabuf v2.2.0-rc2]: https://github.com/ikwzm/udmabuf/tree/v2.2.0-rc2 "udmabuf v2.2.0-rc2" [ZynqMP-FPGA-Linux v2019.2.1]: https://github.com/ikwzm/ZynqMP-FPGA-Linux/tree/v2019.2.1 "ZynqMP-FPGA-Linux v2019.2.1"
Recommended Posts