I/O error

1. Introduction

It first started when I had a look at my SendMail log files where I saw a lot of 'collect: I/O error' messages... It was always for the same remote host and for large mails (size >800Kb).

I first tried to look around the SendMail FAQ but found nothing interesting... (http://www.sendmail.org/faq/section3.html#3.10). Then I took the famous Bat book and start playing with timeout values... nothing did. Finally, I strongly believe it has something to do with the TCP/IP layer rather with SendMail itself.

I used to post several messages at different places to get feedback from others and decided to build this web page to gather all the information I could collect on the subject. As the subject is currently under discussion, this page contains the various posting I did and the answers I received. When a solution will be found, I'll make a summary of the entire discussion...

 

2. Postings

From: Bertrand Renuart
To: comp.mail.sendmail
Date: Wed, 23 Dec 1998 13:33:45 +0100
Subject: collect: I/O error

Hi there,

Hope somebody could help me with this problem.
I'm using sendmail 8.8.7 on a Linux Redhat 5.2 system connected to the net through a 256Kbits connection.
Everything goes ok except for some hosts when they try to send 'large' mails (size above 800K).

Here is what I have in my log files...

-----------------
Dec 20 04:28:56 itma sendmail[7208]: EAA07208: collect: premature EOM: Connection reset by miraculix2.biac.be
Dec 20 04:28:56 itma sendmail[7208]: EAA07208: SYSERR(root): collect: I/O error on connection from ... Connection reset by miraculix2.biac.be
Dec 20 04:28:56 itma sendmail[7208]: EAA07208: from=<...>, size=763054, class=0, pri=0, nrcpts=1, proto=SMTP, relay=miraculix2.biac.be [195.0.74.34]
-----------------

By the way, the load of the system is about 0 all the time... I mean that compute isn't overloaded and is powerfull enough to handle the tasks.

Question:
- is it a sendmail configuration issue (like timeouts value) ?
- is it a network problem ?

Thanks for your help.


From: Bertrand Renuart
To: comp.mail.sendmail
Date: Thu, 24 Dec 1998 00:57:04 +0100
Subject: RE: collect: I/O error

Here are more informations about my problem.

My computer is called itma.interweb.be The following logs show a SMTP session between my host and another one called santaclara01.pop.internex.net.
Sorry for you guys, but I had to give your 'name' as information... anyway don't think the problem is on your side...

My god damned host is connected to the net throught a 256Kbits leased line. The average load on this link is about 40% (probably less during the session shown below).

 

1. Traceroute to the sending side.

Nothing exceptional, pings go through with acceptable delays (except maybe the 14th line)

-----------------
traceroute to 205.158.3.18 (205.158.3.18), 30 hops max, 40 byte packets
1 cs3640-2-e1 (194.78.101.254) 2.306 ms 1.934 ms 2.150 ms
2 cs3640-1-e0 (194.78.71.1) 11.584 ms 11.708 ms 13.902 ms
3 cs7206-1-fe1-0.brussels.interweb.be (194.78.253.2) 42.085 ms 39.381 ms 19.767 ms
4 cs7206-1-fe1-1.brussels.interweb.be (194.78.16.193) 37.920 ms 21.548 ms 24.395 ms
5 cs7206-1-fe1-1.brussels.interweb.be (194.78.16.193) 24.816 ms 23.621 ms 64.073 ms
6 str01-fe2-1-0.belbone.net (194.78.255.198) 114.994 ms 27.770 ms 41.435 ms
7 borderx1-hssi4-0.NewYork.cw.net (204.70.219.109) 232.981 ms 220.637 ms 242.075 ms
8 core1-fddi-0.NewYork.cw.net (204.70.2.17) 149.422 ms 361.792 ms 180.347 ms
9 core1-hssi-2.Washington.cw.net (204.70.1.5) 202.175 ms 169.919 ms 176.662 ms
10 mae-east2-nap.Washington.cw.net (204.70.1.222) 206.580 ms 168.842 ms 242.020 ms
11 mae-east.px.concentric.net (192.41.177.70) 150.785 ms 187.477 ms 208.879 ms
12 us-hq-core1-a3-0-0d11.rtr.concentric.net (207.88.0.73) 287.289 ms 243.994 ms 321.266 ms
13 us-hq-access1-f0-0.rtr.concentric.net (205.158.0.2) 323.791 ms 250.974 ms 245.100 ms
14 santaclara01.pop.internex.net (205.158.3.18) 285.260 ms * 361.418 ms
-----------------

2. Sendmail log output.

Here is what I got in my sendmail log (log level turned to 15)

-----------------
Dec 24 00:13:10 itma sendmail[10029]: SMTP connect from santaclara01.pop.internex.net [205.158.3.18] (205.158.3.18)
Dec 24 00:13:10 itma sendmail[10029]: NOQUEUE: --> 220 itma.interweb.be ESMTP Sendmail 8.8.7/8.8.7; Thu, 24 Dec 1998 00:13:10 +0100
Dec 24 00:13:10 itma sendmail[10029]: NOQUEUE: <-- EHLO santaclara01.pop.internex.net
Dec 24 00:13:10 itma sendmail[10029]: NOQUEUE: --> 250-itma.interweb.be Hello santaclara01.pop.internex.net [205.158.3.18], pleased to meet you
Dec 24 00:13:10 itma sendmail[10029]: NOQUEUE: --> 250-EXPN
Dec 24 00:13:10 itma sendmail[10029]: NOQUEUE: --> 250-VERB
Dec 24 00:13:10 itma sendmail[10029]: NOQUEUE: --> 250-8BITMIME
Dec 24 00:13:10 itma sendmail[10029]: NOQUEUE: --> 250-SIZE
Dec 24 00:13:10 itma sendmail[10029]: NOQUEUE: --> 250-DSN
Dec 24 00:13:10 itma sendmail[10029]: NOQUEUE: --> 250-ONEX
Dec 24 00:13:10 itma sendmail[10029]: NOQUEUE: --> 250-ETRN
Dec 24 00:13:10 itma sendmail[10029]: NOQUEUE: --> 250-XUSR
Dec 24 00:13:10 itma sendmail[10029]: NOQUEUE: --> 250 HELP
Dec 24 00:13:11 itma sendmail[10029]: NOQUEUE: <-- MAIL FROM:<....>SIZE=2649370
Dec 24 00:13:11 itma sendmail[10030]: AAA10030: --> 250 <....>... Sender ok
Dec 24 00:13:11 itma sendmail[10030]: AAA10030: <-- RCPT TO:<....>
Dec 24 00:13:11 itma sendmail[10030]: AAA10030: --> 250 <....>... Recipient ok
Dec 24 00:13:12 itma sendmail[10030]: AAA10030: <-- DATA 
Dec 24 00:13:12 itma sendmail[10030]: AAA10030: --> 354 Enter mail, end with "." on a line by itself
Dec 24 00:30:36 itma sendmail[10030]: AAA10030: collect: premature EOM: Connection reset by santaclara01.pop.internex.net
Dec 24 00:30:36 itma sendmail[10030]: AAA10030: --> 451 collect: I/O error on connection from santaclara01.pop.internex.net, from=<....>: Connection reset by santaclara01.pop.internex.net (held)
Dec 24 00:30:36 itma sendmail[10030]: AAA10030: SYSERR(root): collect: I/O error on connection from santaclara01.pop.internex.net, from=<....>: Connection reset by santaclara01.pop.internex.net
Dec 24 00:30:36 itma sendmail[10030]: AAA10030: from=<....>, size=1585487, class=0, pri=0, nrcpts=1, proto=ESMTP, relay=santaclara01.pop.internex.net [205.158.3.18]
Dec 24 00:30:36 itma sendmail[10030]: AAA10030: done
-----------------

During the session, I had a look at the mqueue directory. The file kept growing up to 1.5mb in about 2 minutes and then nothing until the connection is broken. I have plenty of disk space left (about 1Gb)

 

3. Processes and Load average

A look at the processes:

-----------------
root 10029 0.0 3.1 1408 976 ? S 00:13 0:00 sendmail: server santaclara01.pop.internex.net [205.158.3.18] child wait 
root 10030 2.3 3.3 1468 1048 ? S 00:13 0:06 sendmail: AAA10030 santaclara01.pop.internex.net [205.158.3.18]: DATA 
-----------------

Question: is this normal ? Two processes that handle one incoming request ?
Anyway, the load was near 0 and 45 processes were in memory.
The mail queue was empty, no other sendmail were running except the SMTP listener.

Well, I hope some could help me...
Cheers

Bertrand


From: Bertrand Renuart
To: comp.mail.sendmail, com.os.linux.networking, comp.os.linux.misc
Date: Thu, 24 Dec 1998 13:11:14 +0100
Subject: RE: collect: I/O error

I currently have some strange problems which I believe are TCP/IP related.
It comes from SendMail when receiving large mails (>800K) from some hosts where I got the 'collect: I/O error' error message.
It seems that a lot of people used to have the same problem, but I couldn't find any solution in the newsgroups...

My investigations make me believe it is not a SendMail problem but is more a TCP/IP related issue.
At least, it has nothing to do with SendMail's timeout values. As I said before, incoming mails are queued on my host at relatively high speed (1.6 Mb in 2 minutes). Then it is frozen until a given timeout (either from SendMail or TCP/IP layer) close the connection after 15min. During this time, nothing happens...

So, I did a tcpdump of the traffic from the sending host. The last 15min are shown below (hope the output is readable). Lines have been numbered for later comment.
I replaced the sender's IP address with 'sender' (205.158.3.18:33037)
My host is 'me' (194.78.101.131:smtp)

---------
(01) 01:12:31.590000 sender > me: . 1648691:1650151(1460) ack 408 win 8760 (DF)
(02) 01:12:31.600000 me > sender: . ack 1650151 win 30660 (DF)
(03) 01:12:31.630000 sender > me: . 1650151:1651375(1224) ack 408 win 8760 (DF)
(04) 01:12:31.680000 sender > me: . 1651375:1652835(1460) ack 408 win 8760 (DF)
(05) 01:12:31.730000 me > sender: . ack 1652835 win 32120 (DF)
(06) 01:12:31.810000 sender > me: P 1654295:1655519(1224) ack 408 win 8760 (DF)
(07) 01:12:31.810000 me > sender: . ack 1652835 win 32120 (DF)
(08) 01:12:31.910000 sender > me: . 1655519:1656979(1460) ack 408 win 8760 (DF)
(09) 01:12:31.910000 me > sender: . ack 1652835 win 32120 (DF)
(10) 01:12:32.030000 sender > me: . 1656979:1658439(1460) ack 408 win 8760 (DF)
(11) 01:12:32.030000 me > sender: . ack 1652835 win 32120 (DF)

 ... (same sequence) ...

(12) 01:12:32.840000 sender > me: . 1677901:1679125(1224) ack 408 win 8760 (DF)
(13) 01:12:32.840000 me > sender: . ack 1652835 win 32120 (DF)
(14) 01:12:32.890000 sender > me: . 1679125:1680585(1460) ack 408 win 8760 (DF)
(15) 01:12:32.890000 me > sender: . ack 1652835 win 32120 (DF)
(16) 01:12:32.940000 sender > me: . 1650151:1651611(1460) ack 408 win 8760 (DF)
(17) 01:12:32.940000 me > sender: . ack 1652835 win 32120 (DF)
(18) 01:27:32.890000 me > sender: . ack 1652835 win 32120
(19) 01:27:33.100000 sender > me: R 3960486748:3960486748(0) win 0 (DF)
---------

What can we learn from this ?

(01)-(05) Paquets are sent and acknowledge without any problem up to the offset 1652835
(06) Something strange: the next paquet received from the sender is at offset 1654295. So, I guess we lost the previous paquet that should be within 1652835 and 1654295.
Question: what this P (Push) flag does mean ?
(07) Result: my host keeps sending acknowledge for the last data received in (04). Normal behavior as far as I understand the principle.
(08)-(15) Data keep coming from the sender for the next paquets (offset is increasing). I guess it is normal since everything happened in the same second... the (07) acknowledge didn't probably reach the destination yet.
In the meantime, my host continues to send the same acknowledge as in (07) telling the remote side we lost a paquet somewhere.
(16) The sender seems to be aware of the problem, he probably received one of the acknowledges. I guess the (05) didn't come through, instead the sender is reacting based on the (02) since it starts resending data starting at that point. Anyway, if it had sent more data, it would have violate my TCP/IP window of 32120.
Question: TCP/IP windows are different on both hosts.. normal?
(17) Well, I receive the re-sent data and send back an ack. But since this offset has already been received successfuly, I keep sending the same ack as before.
There everybody is confused: the sender re-send data from 1650151 to 1651611 because he thinks that they didn't come through and I acknowledge further data (1652835)...
Question: does the sender agree with an ack for data he thinks I never received?
(18) I don't know why, but nothing more is coming from the sender. I guess it is totally confused by the previous ack... 15 minutes later, I re-send the same ack but without the (DF) flag (DF -> data fragment?) Immediately, the sender reset the connection !

What the hell is this ?
Looking at this, I believe it is more a TCP/IP issue than a SendMail problem...
Does someone understand this behavior ?

Some technical information:
'Me' is an intel based Linux system (Redhat 5.2 - kernel 2.0.36), running Sendmail 8.8.7
I don't have a clue about the 'Sender' configuration. I'll ask them...

Feedback is welcome...
Bertrand


Last update: 24 Dec 1998 bertrand dot renuart at itma.lu
This link is there just to drive google to this very cool web application