Windows Vista Tips

Windows Vista Tips > Newsgroups > Windows Vista Drivers > PCIe Driver read problem

Reply
Fix Vista Errors
Thread Tools Display Modes

PCIe Driver read problem

 
 
shingo
Guest
Posts: n/a

 
      11-21-2009



Under 32bit Windows XP SP2, I develop a driver for a special designed PCIe
device using FPGA as PCIe moudel.
With WDK 7200, I choose "KMDF" as a driver model.
First "MmMapIoSapce" map the device DDR in to my virtual space, then use
"RtlCopyMemory" read for mapped Virtual Address. Also tried
"READ_REGISTER_BUFFER_XXX" for Physical Address.
This two functions behave the same.Is There any problem here?

When I read for 2 DWORDs, the FPGA get tow seperated requests, each require
for 1 DWORD. This makes the read operation too slow, only 4MBps(I find out
that a PCIe 4x deivce can reach 5GBps).
I can only suppose the pci bus driver (pci.sys) split my request. Why can't
pci bus driver send a require of 2 DWORDs at a time?or Is thers any thing
wrong in my driver or FPGA code?
Thanks for your attention. Waiting forward your reply.

--
shingo for windows driving & winCE driving
 
Reply With Quote
 
Charles Gardiner
Guest
Posts: n/a

 
      11-21-2009
Hi Shingo,

if you want to go faster than about 4MB/s with a PCIe device you will
have to put a DMA controller in your FPGA. The DMA controller must move
the data and not the processor.

Take a look at the PciDrv or PLX9... examples in the WDK.

Processors typically use a load-and-store model which means
loop
read Location n;
write Location M;
end loop;

Even READ_REGISTER_BUFFER_XXX is just a wrapper around such a sequence.
You will find it very very hard to get any processor to burst across a
PCIe connection. I'm sure custom solutions are possible if you have a
processor supporting block-move/block-copy instructions, but this will
not be portable.
 
Reply With Quote
 
Charles Gardiner
Guest
Posts: n/a

 
      11-21-2009
Alexander Grigoriev schrieb:

>
> 4 MB/s means 1 us per read. This seems too much. I suspect it's your FPGA
> that cannot handle it fast enough.
>


Hi Alexander,

1us per read would in fact be quite fast. Typically a single read will
cost you in the region of 1.4 us to 1.8 us, at least that has been the
case on all PCs I have measured (with Logic Analyzer). There are two issues:

1) Read is a split response. i.e. the PC sends a request packet to the
FPGA. The FPGA responds with a completion packet.

2) The operating system seems to have quite a lot of overhead

I suspect current chip-sets /OSes handle a processor read from a
peripheral by just setting up the registers in the I/O controller host
(ICH) and then suspending the thread while waiting for an interrupt. For
an interrupt alone, I have often measured about 300ns response time.
That's just the time needed to call the ISR. The DPC hasn't even run
yet. I don't think there is a whole lot of difference here between Port
(I/O space) and Memory space reads, from a timing point of view.

I have done a few simulations with FPGAs and the hardware alone would
typically require below 800 ns for the round trip (request packet,
process request, completion packet). Sometimes with optimal buffer
settings, credit settings etc. maybe even as good as 450 ns.

Single requests are a terrible waste of bandwidth (if you are in a
hurry). For a single DWORD data (read), you transfer a total of 7 or 8
DWORDS.

Regards,
Charles
 
Reply With Quote
 
Don Burn
Guest
Posts: n/a

 
      11-21-2009
For a normal read case as is being described here with READ_REGISTER_XXX the
operation is a direct processor read, your description is far from reality
this is just a read of a memory location existing on the PCIe bus (so it may
take a few cycles), there is in essence no OS intervention and no
interrupts.


--
Don Burn (MVP, Windows DKD)
Windows Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr
Remove StopSpam to reply




"Charles Gardiner" <> wrote in message
news:he986v$rcp$01$...
> Alexander Grigoriev schrieb:
>
>>
>> 4 MB/s means 1 us per read. This seems too much. I suspect it's your FPGA
>> that cannot handle it fast enough.
>>

>
> Hi Alexander,
>
> 1us per read would in fact be quite fast. Typically a single read will
> cost you in the region of 1.4 us to 1.8 us, at least that has been the
> case on all PCs I have measured (with Logic Analyzer). There are two
> issues:
>
> 1) Read is a split response. i.e. the PC sends a request packet to the
> FPGA. The FPGA responds with a completion packet.
>
> 2) The operating system seems to have quite a lot of overhead
>
> I suspect current chip-sets /OSes handle a processor read from a
> peripheral by just setting up the registers in the I/O controller host
> (ICH) and then suspending the thread while waiting for an interrupt. For
> an interrupt alone, I have often measured about 300ns response time.
> That's just the time needed to call the ISR. The DPC hasn't even run
> yet. I don't think there is a whole lot of difference here between Port
> (I/O space) and Memory space reads, from a timing point of view.
>
> I have done a few simulations with FPGAs and the hardware alone would
> typically require below 800 ns for the round trip (request packet,
> process request, completion packet). Sometimes with optimal buffer
> settings, credit settings etc. maybe even as good as 450 ns.
>
> Single requests are a terrible waste of bandwidth (if you are in a
> hurry). For a single DWORD data (read), you transfer a total of 7 or 8
> DWORDS.
>
> Regards,
> Charles
>
> __________ Information from ESET NOD32 Antivirus, version of virus
> signature database 4626 (20091120) __________
>
> The message was checked by ESET NOD32 Antivirus.
>
> http://www.eset.com
>
>
>




__________ Information from ESET NOD32 Antivirus, version of virus signature database 4626 (20091120) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com




 
Reply With Quote
 
Charles Gardiner
Guest
Posts: n/a

 
      11-21-2009
Don Burn schrieb:
> For a normal read case as is being described here with READ_REGISTER_XXX the
> operation is a direct processor read, your description is far from reality
> this is just a read of a memory location existing on the PCIe bus (so it may
> take a few cycles), there is in essence no OS intervention and no
> interrupts.


Hi Don,

regarding reality, I guess we'll have to disagree on what reality is. I
had a customer peskering me for accurate figures, so I set up the entire
measurement scenario myself. The figures/scenario are very real and very
reproducible. In fact a few months ago I met a chap at a training who
had very similar figures with his own scenario.

My Scenario:
1 x PCIe FPGA Demo Board
1 x KMDF Device driver
1 x Windows XP
1 x GUI application (with CodeGear RAD Studio)

The customer could type in the burst size he wanted in the GUI.
Otherwise two buttons, one read one write and edit boxes to
enter/display the write/read data. The memory available was a 32Kx 32
embedded RAM array inside the FPGA. Internal time to read a single
DWord, about 24 ns (i.e. 3 cycles @ 8 ns).

On the driver side (leaned strongly on the PciDrv, PCL90x0 examples in
the WDK), I converted the IRPs to READ_REGISTER_BUFFER_ULONG or
WRITE_REGISTER_BUFFER_ULONG with the 'count' field being the value that
the user entered in the GUI mask. In reality also verified with TraceView.

I measured the time between consecutive reads /writes by bringing
internal FPGA signals out to test pins. The observation was that the PC
never bursts and memory writes are up to four times as fast as
non-posted (mem read, I/O rd/wr) PCIe requests, since here there is no
completion. What is important, the time measured here is independent of
the GUI and user part of the operating system. I just measure the time
used by the READ_REGISTER_BUFFER_ULONG part of a single IRP e.g. how
long do I need to read 5 consecutive DWords.

What admittedly is only conjecture (i.e. not tangible reality, at least
not for me), is where all this time overhead arises. From my real
simulations, I know it is not in the FPGA (here max ~900ns round trip,
typical more like 600ns). Whether the overhead is in the operating
system as written by Microsoft or in the chip-set driver, probably as
written by Intel, I indeed can't say for sure. From a HW or user point
of view, I would however consider both together as 'the operating
system'. My assumption on the interrupt etc. is based on descriptions I
have seen regarding how Intel chip-sets generally implement port
input/output requests. I'm assuming they do much the same for memory
reads/writes. I am also sure that real memory (as in internal RAM)
reads/writes are handled much more effectively as these go through the
MCH chip in 'standard' Intel chip-sets. The newer Server single-chip
companion chip E3xxx (or some number like that) with natively attached
PCIe lanes for the peripherals are probably also faster but I don't have
such a system (yet).

By the way, any real information you have regarding where the overhead
arises would be much appreciated.

Regards,
Charles
 
Reply With Quote
 
Don Burn
Guest
Posts: n/a

 
      11-21-2009

"Charles Gardiner" <> wrote in message
news:he9e5n$707$01$...
> What admittedly is only conjecture (i.e. not tangible reality, at least
> not for me), is where all this time overhead arises. From my real
> simulations, I know it is not in the FPGA (here max ~900ns round trip,
> typical more like 600ns). Whether the overhead is in the operating
> system as written by Microsoft or in the chip-set driver, probably as
> written by Intel, I indeed can't say for sure. From a HW or user point
> of view, I would however consider both together as 'the operating
> system'. My assumption on the interrupt etc. is based on descriptions I
> have seen regarding how Intel chip-sets generally implement port
> input/output requests. I'm assuming they do much the same for memory
> reads/writes. I am also sure that real memory (as in internal RAM)
> reads/writes are handled much more effectively as these go through the
> MCH chip in 'standard' Intel chip-sets. The newer Server single-chip
> companion chip E3xxx (or some number like that) with natively attached
> PCIe lanes for the peripherals are probably also faster but I don't have
> such a system (yet).


If you are referring to READ_REGISTER_ULONG operation, then your conjecture
is way off. If you look at the include file that defines this the operation
it is a memory barrier operation and then a memory read, or just a volatile
memory read so talking about the OS or the chip set driver getting in the
middle is nonsense. These are non-cached and they do go through Intel
chipset and you are are issuing one operation at a time, but it is not some
mysterious driver that is getting in the way.


--
Don Burn (MVP, Windows DKD)
Windows Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr



__________ Information from ESET NOD32 Antivirus, version of virus signature database 4627 (20091121) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com




 
Reply With Quote
 
Alexander Grigoriev
Guest
Posts: n/a

 
      11-21-2009

"Charles Gardiner" <> wrote in message
news:he9e5n$707$01$...
>
> What admittedly is only conjecture (i.e. not tangible reality, at least
> not for me), is where all this time overhead arises. From my real
> simulations, I know it is not in the FPGA (here max ~900ns round trip,
> typical more like 600ns). Whether the overhead is in the operating


Do you mean 600-900 ns is your FPGA's roundtrip time? Do you have PCIe
analyser? It would come very handy to debus such issues.


 
Reply With Quote
 
Charles Gardiner
Guest
Posts: n/a

 
      11-21-2009

> If you are referring to READ_REGISTER_ULONG operation, then your conjecture
> is way off. If you look at the include file that defines this the operation
> it is a memory barrier operation and then a memory read, or just a volatile
> memory read so talking about the OS or the chip set driver getting in the
> middle is nonsense. These are non-cached and they do go through Intel
> chipset and you are are issuing one operation at a time, but it is not some
> mysterious driver that is getting in the way.
>


To be precise, I'm referring to
READ_REGISTER_BUFFER_ULONG(Register, Buffer, Count).

which is defined in wdm.h as

__forceinline
VOID
READ_REGISTER_BUFFER_ULONG (
PULONG Register,
PULONG Buffer,
ULONG Count
)
{
_ReadWriteBarrier();
__movsd(Buffer, Register, Count);
return;
}

#define READ_REGISTER_BUFFER_ULONG(x, y, z) { \
PULONG registerBuffer = x; \
PULONG readBuffer = y; \
ULONG readCount; \
__mf(); \
for (readCount = z; readCount--; readBuffer++, registerBuffer++) { \
*readBuffer = *(volatile ULONG * const)(registerBuffer); \
} \
}


Are you saying that in reality the processor just sits there for about
1.5 us per iteration waiting for PCIe to come back with a single data
DWORD (because that is what I measure at the hardware). Surely not.

But 1.5 us is one heck of a long time for a processor which is supposed
to be running at say 2.x GHz. Assuming then that there are is no
'mysterious driver' involvement. What is happening in reality?
- Thread suspension + polling of some status register in the ICH to say
when PCIe has finished?
- Indefinite thread suspension until ICH signals PCIe completion over an
FSB interrupt?
- Something else?
 
Reply With Quote
 
Charles Gardiner
Guest
Posts: n/a

 
      11-21-2009
Alexander Grigoriev schrieb:

>
> Do you mean 600-900 ns is your FPGA's roundtrip time? Do you have PCIe
> analyser? It would come very handy to debus such issues.
>
>

The figures are:
- roughly 450 ns for the first few packets, i.e. buffers empty plenty of
credits
- typical 600 ns, DMA traffic but no credit stalls
- worst 900 ns, heavy DMA and some credit stalls

With the round-trip time, I mean first header DWORD into PCIe core in
requestor to last DWORD of completion packet arriving at requestor. This
was measured in simulation with two identical PCIe cores connected
back-to-back (Aldec VHDL/Verilog simulator, Lattice ECP2M FPGA). i.e.
this is the time that user logic in the PCIe end-point would see if the
completer was a pure hardware implementation and could deliver data as
soon as the request had been received.

Packet transmission is normally pretty fast. It's the reception that's
slow since the packet has to be checked by the data-link layer before
passing it on to the user/application logic. Switches in the path often
use transparent mode i.e. the data layer checks on the fly and issues a
'nullify' if it unexpectedly detects a link CRC error.

The assumption here is of course that all chips have much the same
overhead in the data-link/physical layers. From the figures I have from
different chips or heard from people on different projects, this is the
case. In PCIe Gen 1.x, your byte time is 4 ns (UI 400 ps).
 
Reply With Quote
 
Don Burn
Guest
Posts: n/a

 
      11-21-2009
The processor is blocking waiting for the PCIe transaction to do the write
to complete. There is no magic polling or interrupt here. This is a
function of the processor to PCIe to device and return path. You may want
to believe surely not, but there is no software in this just the hardware.


--
Don Burn (MVP, Windows DKD)
Windows Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr




"Charles Gardiner" <> wrote in message
news:he9l04$eci$03$...
>
>> If you are referring to READ_REGISTER_ULONG operation, then your
>> conjecture
>> is way off. If you look at the include file that defines this the
>> operation
>> it is a memory barrier operation and then a memory read, or just a
>> volatile
>> memory read so talking about the OS or the chip set driver getting in the
>> middle is nonsense. These are non-cached and they do go through Intel
>> chipset and you are are issuing one operation at a time, but it is not
>> some
>> mysterious driver that is getting in the way.
>>

>
> To be precise, I'm referring to
> READ_REGISTER_BUFFER_ULONG(Register, Buffer, Count).
>
> which is defined in wdm.h as
>
> __forceinline
> VOID
> READ_REGISTER_BUFFER_ULONG (
> PULONG Register,
> PULONG Buffer,
> ULONG Count
> )
> {
> _ReadWriteBarrier();
> __movsd(Buffer, Register, Count);
> return;
> }
>
> #define READ_REGISTER_BUFFER_ULONG(x, y, z) { \
> PULONG registerBuffer = x; \
> PULONG readBuffer = y; \
> ULONG readCount; \
> __mf(); \
> for (readCount = z; readCount--; readBuffer++, registerBuffer++) { \
> *readBuffer = *(volatile ULONG * const)(registerBuffer); \
> } \
> }
>
>
> Are you saying that in reality the processor just sits there for about
> 1.5 us per iteration waiting for PCIe to come back with a single data
> DWORD (because that is what I measure at the hardware). Surely not.
>
> But 1.5 us is one heck of a long time for a processor which is supposed
> to be running at say 2.x GHz. Assuming then that there are is no
> 'mysterious driver' involvement. What is happening in reality?
> - Thread suspension + polling of some status register in the ICH to say
> when PCIe has finished?
> - Indefinite thread suspension until ICH signals PCIe completion over an
> FSB interrupt?
> - Something else?
>
> __________ Information from ESET NOD32 Antivirus, version of virus
> signature database 4627 (20091121) __________
>
> The message was checked by ESET NOD32 Antivirus.
>
> http://www.eset.com
>
>
>




__________ Information from ESET NOD32 Antivirus, version of virus signature database 4627 (20091121) __________

The message was checked by ESET NOD32 Antivirus.

http://www.eset.com




 
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
WLM Get faultcode: Windows Live Communication Platform has stoped working Masse Borglund Windows Live Messenger 18 11-16-2009 08:37 AM
Vista install problem - Failed to add driver package into driver s zigner Windows Vista Installation 2 03-28-2007 02:59 AM
problem in dxdiag Peewee64 Windows Vista Games 0 02-14-2007 10:49 PM
Vista auto-reboot after install OrangeGuy Windows Vista Installation 4 02-07-2007 03:05 AM
Re: No Matter What Installations Won't Boot On My Laptop Richard Urban Windows Vista Installation 1 01-06-2007 06:08 AM



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59