[En-Nut-Discussion] Questions regarding UART implementation

Wed Sep 29 23:56:39 CEST 2010

Hi!

Wow, I had a tricky problem with my system... Calling a newlib function
in a program causes the data segment to have an offset...
Took me a time to find that out.

Am 29.09.2010 21:29, schrieb Thiago A. Corrêa:
> Hi Ulrich,
>
> There are a lot of topics here :)
>
> On Tue, Sep 28, 2010 at 1:44 PM,<uprinz2 at netscape.net>  wrote:
>> Hi Thiago,
>>
>>
>> thanks for the infos. I thought that this would happen :) Lots of
>> nice feature switches and none implemented. I agree to you, that
>> sometimes it makes sense to determine the old lines and to abstract
>> them. It's a good idea and I try to support it.
>>
>>
>> For the blockread / blockwrite functions I found a missleading
>> description. So should there be some totally different things:
>> USART_BLOCKWRITE should control block transfer for writing.
>> USART_BLOCKREAD should control block transfer for reading.
>> USART_BLOCKING should control if the calling thread is blocked
>> until transceiving is done. I think there is an option ASYNC too
>> which I would call the one that controls if a thread is blocked or
>> not on calling a usart transfer (read or write). But that is not
>> working or not implemented.
>
> From what I understand, it helps to read from devices which send a
> fixed size "packet". AFAIK it's not implemented in any of our archs.
> Even on PCs it's fairly uncommon to use it. Even ppl who write bad
> code to read from barcode scanners, don't usually use that feature.
> It's usually much better to find the STX or ETX yourself, or
> otherwise parse the protocol properly.
>
May be it is uncommon, but it can save a lot of memory and processing
time. My application uses lots of packets of different size and I don't
like to rise the ringbuffers to over 400 bytes just in case.

Ok, here's what I do:
I have the basic frames in flash. If I need to send one, I borrow a area
from heap and memcpy the frame into that area. Then I overlay a virtual
struct and write the data into the struct. In fact the data is directly
written into the borrowed heap space.
Then I hand over the frame to the sending routine.
Why does this routine have to copy the struct into a ringbuffer? It is
there in ram and can be used as it is. After transmission I can free the
frame from the heap.

For reception I know which maximum size of packet is to be expected. In
the boot phase I need one buffer for the biggest answer. If I then know
the partner on the other end I have to switch to the appropriate buffer
size cause I have to request the right amount of bytes. Otherwise I
always have to wait for the timeout of the _read() function.
After reception of the frame I use some bytes for frame checking and
decoding and then again I overlay a virtual struct to read the data out.
Again, there is already a buffer available, no need for a receiving
buffer whose data needs to be copied first before appearing in my buffer.

That method saves a lot of buffer and copy action.

Together with DMA it even saves a lot of interrupt time.

>>
>> For all those functions I miss something too: If you use transfers
>> async, you will not get an reply on the read/write that is valid as
>> the transfer is not finished at that time. So _ioctl needs another
>> option too. Besides getting the information about the errors from
>> the last transfer one needs to get the status of the current
>> transfer, i.e. the number of bytes trasnferred and the status if
>> the transfer is ongoing or whatever reason aborted.
>>
>
> I wrote a while ago a serial port class for my desktop apps and
> spent some time digging the Unix and Windows APIs. On Linux, if you
> use a non-blocking transfer, read() returns EWOULDBLOCK which is a
> define to some negative number. Much like the sockets API. Otherwise
> it returns the number of read bytes. Is that what you mean?
>
Yes that sounds good.

>> So what we have is a usart that relies on ringbuffer even the
>> ringbuffer struct supplies blockptr / blockcntr. If you use packet
>> oriented data, you cannot handle timeouts on packets cause the
>> timeouts are based on the ringbuffer and, if the ringbuffer is to
>> small two locks block your thread no, better, something blocks you
>> that you cannot determine.
>
> True. Even if we implement the packet based API, we would have to
> make the serial buffer size configurable and make the call fail with
> some error code if it requests a packet larger than the buffer.
>
Normally that cannot happen. If you request a packet of n bytes the
usart will read that much ( via IRQ or DMA) and then posts the
wait-event to the calling function.
All characters coming in after that need to be discarded as there is no
valid block pointer available anymore. My first idea was to disable RX
but that can lead to a problem if there is someone sending unexpected
data on the line. We need to check the RX status and clean it up before
switching to a new reception.
>>
>> If you have a function that allocated memory to form a block you
>> don't need to copy it to the ringbuffer for transfer but the actual
>> implemenation does. On smaller systems that need packet oriented
>> communication ( block transfer) the ringbuffer memory could bee
>> freed completely.
>
> This would be a nice trick. But I'm not sure how it would fit in our
> current driver structure. Somehow we would have to get the buffer
> pointer down to the driver level. Then again, I wonder if there is
> real use for the packet oriented reads.
>
There is:
I send STX I 01 CSH CSL ETX to my RS485. The connected device nr. 01
answers with STX i 01 ....data....lots of data.... CSH CSL ETX.
I Know how much data it will send and so I can give the _read(rs485,
buffer, packet_size, timeout) a packet_size that I know.

So if the packet arrives I am happy and can decode the data. If _read
returns with a timeout I can assume that someone pulled the cable and
the device is off line. I can retry a while and then switch to device
detection again.

>>
>> On bigger systems with DMA/PDC support, you save a lot of CPU time
>> for all those TXE Interrupts that do not appear.
>>
>>
>> Unfortunately I cannot implement DMA in the actual structure as DMA
>> should lock the calling thread until transfer is over or set a
>> signal after finishing the transfer. I tried to do that by using
>> the normal StartTx(void) function that will rise an TXE Interrupt
>> and this first TxReadyIrq( RINGBUF *rbf) will setup the DMA
>> process. Unfortunately this function is out of thread as it is an
>> interrupt and therefore cannot set a NutEventWait that blocks the
>> calling thread.
>>
>
> I'm confused. Why wouldn't the calling thread keep it's blocked
> state from read()?
Yes it does, but because usart.c is blocking it. If you bypass usart.c
functions, mainly the ringbuffer functions, you bypass NutEventWait.

I was confused too. Infact there is no blocking control via _ioctl().
And there doesn't need to be one. You call read/write with timeout and
decide how blocking it is. With a timeout of NUT_WAIT_INFINITE it is
very blocking :)

> Anyway, I thought about using DMA first with the EMAC driver, which
> should benefit the most from it, as it transfers at least an
> ethernet frame each time.
Yes definately! I tried out that one on AT91SAM7XE and had to find out
that this chip doesn't support memory2memory DMA. As the EMAC on the
Ethernut Radio is an external chip, DMA doesn't work. With SAM7X it
should work. STM32 has lots of DMA channels and for the large chips
there are even two independent DMA controllers with multiple channels
available.

> I see that u-boot, linux and other kernels usually define a API for
> DMA, with dma_alloc (similar to malloc). The question is, should we
> try to do something like that, and have each arch provide the
> implementation, or should we confine the DMA engine within the arch
> folder as a private API, so each port does it as it pleases.

I already wrote the DMA_Setup() function for STM32 that does detect most
things automatically. But it is a young system so I am open fpr good ideas.
>
>> So here's what I am thinking about: Assume that all functions for
>> the USARTs get the USARTDCB as an argument: - You can implement
>> block-control with DMA in StartTx on CPUs that support that
>> feature. - You can implement block transfer by saving the
>> ringbuffer (Data is taken from the calling functions buffer
>> pointer) - You can write totally different usart drivers for
>> totally different architectures by keeping the Nut/OS usual
>> function calls.
>>
>>
>> In my STM32 implementation I fear that if I can call one set of
>> functions from all usart interrupts the code in flash will be much
>> lower even I implement and enable all features. All features mean:
>> HW/SW-Handshake, DTR/DSR support, XON/XOFF, STX//ETX,
>> Full-/Half-Duplex, IRDA, ...
>>
>
> It makes a lot more sense to have all functions for the USART to
> receive the DCB structure or the DEVUSART structure. That's how
> drivers in Linux and Windows Driver Model works (sort of).
>
I didn't look at Windows or Linux but it was sort of inconsistent to me 
as it is actually. And as it opens a lot of options by having only a 
very small impact I declared it a good idea:)

>> The backdraw of this change would be that all architectures have to
>> be modified to pass DEVUSART *dev or at least USARTDCB *dcb to all
>> functions. That would lead to one small problem, any function
>> accessing the ringbuffer needs to derive it from the dcb. For a
>> 72MHz STM32 it's not a problem to do RINGBUF *rbf = dcb->dcb_rbf;
>> at every function start. But how is that on an AVR?
>
> That could easily be offset by any deduplication we achieve in the
> code. It should not be too much per function really.
>
Yes, that's what I thought. If I can reach a code of half the size I 
might like to pay 4 extra cycles in the interrupt routine. I need to 
compile both versions in STM and AVR and look at the assembler output.
>
>> Ah, by the way. I am thinking about making the things a bit
>> comfortable. So one could set "Use Interrupts" and "Use DMA"
>> independantly for every USART in a system. If it stays like it is,
>> so usart1.c includes usart.c this saves some flash if the user
>> unchecks the one or the other option. If there is only one usart.c
>> calld by the interrupts of usartx.c it could be an idea to include
>> portions of the code only if at least one of the usarts has enabled
>> that option. So DMA handling in the general driver is only enabled
>> and compiled if at least one usart has the option set in nutconf.
>>
>
> Wouldn't it actually make the code bigger? Some routines would be
> duplicated in the binary blob, one with DMA and another without. I'm
> not sure if there is a use-case were one would like to enable DMA
> for one USART but not for the others.

That's how it is actually!
usart1.c defines lots of #defines and includes usart.c. then usart2.c 
redefines the #defines and includes usart.c again. So in fact without 
tricky compiler optimizations you get the full code per uart.

I need to do some tests for that. DMA in cpus is not an unlimited thing. 
The Cortex architecture is great and the DMA does normally not interfere 
with CPU bus access. But if you set up several transfers you get an 
performance impact. So the user might like to decide which device uses 
the DMA and which might be delayed. For serial ports it is ridiculous 
but if your audio streams break cause you are using serial port with DMA...

Ah, before I forget! The DMA channels are shared. So there are channels 
that can be used for I2C and USART but both at once. So either you have 
wait for the current transfer to finish before taking over the channel 
or you have to decide by design. For Nut/OS I'd prefer the last one as 
this keeps the things simple and controllable. If you assign the channel 
to TWI it is not available for USART anymore. Simple an effective.

>
>> So now I have three options: 1 Modify usart.c / usart.h / uart.h to
>> the new structure and hope that someone is helping me to bull AVR
>> and ARM architecture to that level.
>
> I can help with AVR and AVR32.
>
>> 2 Just split usart.h / uart.h into stm32_usart.h and other_usart.h
>> while usart.h includes the one or the other depending on the
>> architecture selected.
>
> It can easily became a nightmare regarding to maintenance and
> portability.
>
>> 3 Leave it as it is and forget about that all :)
>
> Tempting *smile* Actually I think Nut/OS already has the most
> compreensive USART driver from the RTOS I know of, and for the
> applications we work with, that's a huge benefit :) But it's also
> quite hard to maintain the way it is... If a bug is found in the flow
> control code for instance, one has to remember to fix it in all other
> archs, and it only get's worst with new platforms being added.
>
I fully agree!
But how to manage the things? There are only two ways:
The way it is means if you change one thing in one architecture the code 
doesn't compile for the other architectures until you put the fix in all 
of them.
The other way would be to define a minimum specification and any driver 
/ architecture has to fulfill it. But it can put things on top.
>>
>> By the way, Option 2 is what I did for TWI cause STM32 has two
>> interfaces and 4 interrupt providers ( two per interface) that call
>> the same code existing only once. Old Tw*() functions are #defined
>> to the stm32 specific functions. Works fine here :)
>>
>
> Yesterday I was thinking about a platform independent TWI. So we
> could have platform independent drivers to access EEPROMs and Atmel
> QTouch chips.
>
Oh, sorry, but now you make me a bit sad... I already wrote that and it 
is in the trunk. It is called at24c.c driver.
With the STM port it works fine. I did some fixes and extensions so you 
can now configure your EEPROM in nutconf and just call EE_Init().
There you are. It supports paging, NACK-Polling and it also takes 
interleaved bus access without breaking.
So I fired some LED and key threads to LEDs and keys connected to an I2C 
expander. Then I take lot's of different strings larger than 32 bytes 
and now I fired the string to the EEPROM while pressing keys and have 
LEDs blinking. All over one 400kBit/s I2C. Worked fine with AT91SAM7X 
and STM32F10x.

What I did new for the STM32 was to introduce TWI as a bus for Nut/OS. 
So you officially need to do NutRegisterTwiBus(devTwiBus1...).
Then you can register nodes to the bus or call Read/Write functions and 
hand over the device to them.

> But I'm actually quite worried about the GPIO. I'm going to start
> working on a board with UC3B164 connected with sensors/relays. I
> would like to see and use an interface to set pin functions, level,
> configure interrupts, etc in a way that's standard and portable
> between current and future platforms. How are you handling this with
> STM32? Btw, which STM32 are you using? I would like to take a look at
> the datasheets :)
>
I am using
STM32F103RB (64 Pin, 64k Flash, 20k RAM)
STM32F103ZE (144 Pin, 512k Flash, 64k RAM)
STM32F107VC (144 Pin, 256k Flash, 64k RAM, MII)

For the GPIO I have the following idea:
Lets write down a specification for "must have" options for GPIO. I 
already started the in the whitepaper section of the wiki.

So every CPU has to support
PIN_CFG_OUTPUT
PIN_CFG_INPUT
PIN_CFG_MULTIDRIVE
PIN_CFG_DISABLE
PIN_CFG_PERIPHAL
may be i forgot something but it's just an example.

After defining this set of options we define a basic set of functions:

GpioPinConfigSet()
GpioPinConfigGet()
GpioPortConfigSet()
GpioPortConfigGet()

These are defined in gpio.h
But gpio.h includes an architecture specific stm32_gpio.h too.
This specific file adds what this architecture puts on top.

The basic set is required for keeping all example codes running on all 
supported platforms. The optional set of defines is what you need for 
your special application.

For the STM I added PIN_CFG_INPUT, PIN_CFG_ANALOG....
The STM support some energy saving by letting you decide how fast to 
drive the pin. So I mapped PIN_CFG_OUTPUT to the 50MHz option. But I can 
now put PIN_CFG_OUTPUT50, PIN_CFG_OUTPUT10 and PIN_CFG_OUTPUT2 on top 
without breaking the dependencies for the examples.

For STM I have one additional problem. You do not switch pins to a 
secondary function or a third one. You switch the function to the 
pins... And you need to enable the clock supply for every thing in the 
chip. That I still have to figure out how to implement best in Nut/OS.

Ah, we definitely need switch the defaults for a pin. I would prefer to 
have PIN_CFG_OUTPUT configuring a pin as output open drain and instead 
of PIN_CFG_MULTIDRIVE you define PIN_CFG_PUSHPULL if you want to have 
that. This would prevent boards from being killed.

So what do you think about splitting gpio drivers to architecture 
specific things but keep a mandatory set of options and functions?

It has an additional benefit:
For AVR you define only 8-bit wide options and port/pin numbers
For larger architectures you can use 16 ore 32 bits.
This would speed up things a lot especially for 8 bit systems.

So, that's enough for one mail in the list :)

Best regards
Ulrich