BML USB 3.0 FPGA interface over PMOD


An open-source-hardware USB 3.0 to FPGA PMOD interface design from Black Mesa Labs:

Black Mesa Labs is presenting an open-source-hardware USB 3.0 to FPGA PMOD interface design.  First off, please lower your expectations. USB 3.0 physical layer is capable of 5 Gbps, or 640 MBytes/Sec. This project can’t provide that to your FPGA over 2 PMOD connectors – not even close. It does substantially improve PC to FPGA bandwidth however, 30x for Writes and 100x for Reads compared to a standard FTDI cable based on the FT232 ( ala RS232 like UART interface at 921,600 baud ). A standard FTDI cable is $20 and the FT600 chip is less than $10, so BML deemed it a project worth pursuing.

More details at Black Mesa Labs homepage.

Via the contact form.

Join the Conversation


  1. This is the kind of circuit I’d like to see on an FPGA Dev Board – rather than as an add-on.
    Using 3 PMOD’s may not be practical, but using the equivalent number of tracks would be a relatively small upgrade from 2 PMOD’s worth.

    1. I’ve got a couple of projects in the pipe regarding this, one is a small QFN48 FPGA that will connect directly to the FT600 and reduce the pin count selectably down to either a single PMOD ( about ~50 MByte/Sec ) or just 3 pins ( about ~50 Mbits/Sec ). The other is a FPGA board that uses 67pin M.2 PCIe connector ( Laptop SSD connector ) for expansion. A FT600 would fit easily on a add on board like this. Thanks for reading!

  2. Later in their post, it is said the design is theoretically capable of 46MB/sec, not too shabby. I had a good laugh: they used Python to push data and are then surprised when performance crawled. Folks these days don’t know the ballpark performance of interpreters? Hmmmm. Anyway, USB 2.0 (with widely available parts) can routinely do 25MB/sec for storage transfers, so it is not evident to me that they must have USB 3.0 for their use case. Small transfer block sizes is very very bad on USB speed, one must deal with transfer sizes first, then we can have good performance.

    1. In Python’s defense, I regularly get about 10 MBytes/Sec from Python on a PCIe interface to an FPGA with a PCIe core. Interpretation is pretty damn quick when executing a small script that is fully cache able from a 3 GHz CPU, especially with block data transfers. That said, Python is still the bottleneck here, just 10x more than I thought it’d be. Thanx for reading.

      1. Thanks for replying. I read the thing again, and I missed the part where you had already explained the bottleneck with the 154us/160us timing, my bad. And you are using 512 byte bulk packets on the USB side, so not that much overhead. So Python it is. How much CPU is your Python script using? (Process Explorer on the status bar for monitoring CPU usage is a must for me on Windows.)

        I wonder if your Python to PCIe interface has a lot less data-massaging done. In, you are constructing the payload as a string of ASCII nibbles as the data bytes. Say, 32 DWORDS gives 256 nibbles, so you are constructing a 256 byte payload. But Python strings are immutable, so you are making 32 strings in between, and that involves moving over 4KB of string bytes (8B string, 16B string, 24B string, … 256 byte string.) That’s a 33X data processing explosion there, and while running, Python is also furiously garbage collecting all the unused strings too.

      2. In, I suspect the string format line is also a performance killer:

        payload += ( “%08x” % each_data );

        A mini string interpreter in your inner loop there. The “%08x” format string may be scanned and its items executed for every iteration. I don’t know enough of Python to know how well it optimizes such code.

  3. I’ve haven’t monitored CPU usage, but will make a point to in the future. You are absolutely correct about Python and strings. My hardware is 100% capable of doing binary transfers, it was a deliberate decision when I spec’d Mesa Bus Protocol to transfer ASCII strings from software domain to hardware as in my experience it greatly reduces software development and debug time. On the same note, at both my day job and my BML projects my time is dedicated 90% on Hardware development and 10% (or less) on support software. For that reason, I hardly ever write in C anymore, as I find myself 100x more efficient writing Python versus C ( and same was true for Perl before I switched to Python ). Python can be funny, I fixed a major rendering performance bottleneck in my GUI ( Waveform viewer for embedded Logic Analyzer ) after discovering “if ( var_bool == True )” is considerably slower than “if (var_bool )”. Assumed my VHDL influenced verbose style would just get optimized, I was sadly wrong.

    1. IMO always try to avoid strings in production high-speed serial I/O code (it’s OK for ground-level dev and debug though). In C for example zero terminated variable length strings will often lead to heap fragmentation and possible unbridled heap growth either of which may lead to creation of buffer under-overflow vulnerabilities. Work with a pre-defined allocated array space instead. With interpreters, compiling the script to an executable can sometimes help (by Magic – YMMV, don’t trust it).

  4. Those complaining about Python speed usually don’t know snitch about software performance. Often the bottleneck is not in the language but in the datastructures and algorithms used and Python not only supports a plethora of those (and all included ones plus the most popular external moduls are ridiculously fast compiled C/C++ code) but also makes it easy to make a good choice and evaluate performance of alternatives. I’ve routinely beaten the crap out of optimized C++ that took magnitudes longer to develop than the first iteration of my PoC in Python for numerical applications. Iff you manage to write decent Python code, the first *real* bottleneck you’ll hit is the GIL; it’s really tricky to exploit multiple cores with pure Python code. But once you know where the most baggage is, it’s typical to look for an extension module that does what you need (and has the C code to do it natively and around the GIL) or you write it yourself.

    I’ve done realtime image capturing, processing and classification coming from two cameras, each almost fully saturating a GigE link (uncompressed RAW images), in Python (with some C extensions) on a Dual-Xeon over a decade ago… Go figure.

    1. It depends on how Python is used; different coders will have different experiences. You obviously know how to get the most out of Python. I note that you say ” (with some C extensions)” for your realtime use case. Compiled extensions are fast, yes we know. And in your use cases you obviously went in the direction to get enough performance for your needs. Tons of big games and large apps are scripted too, all with C functions where speed is necessary. But it takes time to get to be at the top of your game.

      The problem arises when someone codes a quickie prototype (often totally fine for development testing of hardware, but hits limits in this case) and such a thing is often completely scripted (when speed is initially not an issue, just correct functionality), and the coder has limited experience in avoiding inefficient code (see Assume the coder has been happy with performance thus far (note that his quoted Python app to PCIe performance of 10MB/sec may also have been bottlenecked by scripting, but it was “fast enough” for a quickie prototype.) So it’s a particular app script that is slow, not Python *cough*. I apologize for not writing perfectly precise words 100% of the time, I should be replaced by an AI. It’s an issue with quickie prototyping where the coder has likely been using all-scripting, not in your case where you used compiled extensions where appropriate.

      1. Part of my point was that it’s all in the data structures and algorithms, not optimisation. Python can be blazingly fast “as-is” and compiled languages can be dog slow (and often are because the “best” data structures for a use case are far to cumbersome to use for a quick prototype). With a bit of thought pure Python code can be really quick (even if limited to a single core) and my stomach experience says that decoding 100MB/s+ of packetized ASCII in pure Python on a single core of any modern Intel/AMD CPU should not be any issue at all.

Leave a comment

Your email address will not be published. Required fields are marked *

Notify me of followup comments via e-mail. You can also subscribe without commenting.