从文档在c#中阅读内容reading content from document in c#

- 此内容更新于:2015-01-06
主题:

I have read the data from an .docx file using stream reader and got the content in string and printed using Console.writeLine. This content is not same as that of content which i got using File.ReadAllBytes function for the same file. And the codes are shown below // first code This is my output when I used the above code PK ! ߤ�lZ [Content_Types].xml �(� ���n�0E�����Ub袪*�>�-R�{V��Ǽ��QU� l"%3��3Vƃ�ښl �w%�=���^i7+���-d&�0�A�6�l4��L60#�Ò�S O����X� �*��V$z�3��3������%p)O�^����5}nH"d�s�Xg�L�`���|�ԟ�|�P�rۃs�?�PW��tt4Q+��"�wa���|Ty���,N���U�%���-D/��ܚ��X�ݞ�(���<E��)�� ;�N�L?�F�˼��܉��<Fk� �h�y����ڜ���q�i��?�ޯl��i� 1��]�H�g��m�@����m� �� PK ! ��� N _rels/.rels �(� // second code byte[] x = File.ReadAllBytes("D:\sample.docx"); File.WriteAllBytes("C:\file3.txt", x); Both the file contents are different. Is there any possible way of my first code to get the same content as that of second code? This is my output when used ReadAllBytes PK ! ߤÒlZ [Content_Types].xml ¢( ´”ËnÂ0E÷•ú‘·Ub袪*‹>–-Ré{Vý’Ǽþ¾QU‘ l"%3÷Þ3VƃÑÚšl µw%ë=–“^i7+Ù×ä-d&á”0ÞAÉ6€l4¼½L60#µÃ’ÍS Oœ£œƒXø Ž*•V$z3„ü3à÷½Þ—Þ%p)Oµ^ “²×5}nH"dÙsÓXg•L„`´‰ê|éÔŸ”|—PrÛƒsðŽ?˜PWŽìtt4Q+ÈÆ"¦wa©‹¯|Ty¹°¤,NÛàôU¥%´úÚ-D/‘ÎÜš¢­X¡Ýžÿ(¦¼<EãÛ)‘à ;çN„L?¯Fñ˼¤¢Ü‰˜¸<FkÝ ‘h¡yöÏæØÚœŠ¤Îqôi£ã?ÆÞ¯l­Îià 1éÓ]›HÖgÏõm @ÈæÛûmø ÿÿ PK ! ‘·ï N _rels/.rels ¢(

原文:

I have read the data from an .docx file using stream reader and got the content in string and printed using Console.writeLine. This content is not same as that of content which i got using File.ReadAllBytes function for the same file.

And the codes are shown below

// first code

StreamReader streamReader = new StreamReader("D:\sample.docx");
String text = streamReader.ReadToEnd();
Console.WriteLine(streamReader.CurrentEncoding);//it shows the ecoding as UTF8
byte[] array = Encoding.UTF8.GetBytes(text)
File.WriteAllBytes("D:\file3.txt", array);

This is my output when I used the above code

PK     ! ߤ�lZ      [Content_Types].xml �(�                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ���n�0E�����Ub袪*�>�-R�{V��Ǽ��QU�
l"%3��3Vƃ�ښl    �w%�=���^i7+���-d&�0�A�6�l4��L60#�Ò�S
O����X� �*��V$z�3��3������%p)O�^����5}nH"d�s�Xg�L�`���|�ԟ�|�P�rۃs�?�PW��tt4Q+��"�wa���|Ty���,N���U�%���-D/��ܚ��X�ݞ�(���<E��)�� ;�N�L?�F�˼��܉��<Fk� �h�y����ڜ���q�i��?�ޯl��i� 1��]�H�g��m�@����m�  �� PK     ! ���   N   _rels/.rels �(�                    

// second code

byte[] x = File.ReadAllBytes("D:\sample.docx");
File.WriteAllBytes("C:\file3.txt", x);

Both the file contents are different. Is there any possible way of my first code to get the same content as that of second code?

This is my output when used ReadAllBytes

PK     ! ߤÒlZ      [Content_Types].xml ¢(                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  ´”ËnÂ0E÷•ú‘·Ub袪*‹>–-Ré{Vý’Ǽþ¾QU‘
l"%3÷Þ3VƃÑÚšl  µw%ë=–“^i7+Ù×ä-d&á”0ÞAÉ6€l4¼½L60#µÃ’ÍS
Oœ£œƒXø Ž*•V$z3„ü3à÷½Þ—Þ%p)Oµ^ “²×5}nH"dÙsÓXg•L„`´‰ê|éÔŸ”|—PrÛƒsðŽ?˜PWŽìtt4Q+ÈÆ"¦wa©‹¯|Ty¹°¤,NÛàôU¥%´úÚ-D/‘ÎÜš¢­X¡Ýžÿ(¦¼<EãÛ)‘à ;çN„L?¯Fñ˼¤¢Ü‰˜¸<FkÝ  ‘h¡yöÏæØÚœŠ¤Îqôi£ã?ÆÞ¯l­Îià 1éÓ]›HÖgÏõm @ÈæÛûmø  ÿÿ PK     ! ‘·ï   N   _rels/.rels ¢(          
user1666620的回复:你是什么意思的内容不同吗?你在说什么字体等等?

(原文:what do you mean by the content being different? Are you talking about fonts etc?)

Selman22的回复:这# 39;不是一个文本文件。如果你想阅读一个word文档使用互操作或有库,使用其中的一个。

(原文:It's not a text file. if you wanna read a word document use Interop or there are libraries for that, use one of them.)

wazza的回复:当我读到使用字节....我有一些二进制数据,但是当我做它使用字符串显示一些数据? ? ?

(原文:when I read using bytes....I got some binary data but when I done it using string it shows some of the data with ???)

helb的回复:你想做什么?你为什么不只是使用File.Copy()?同时,尝试搜索“encoding"。

(原文:What are you trying to do? Why are you not just using File.Copy()? Also, try googling for "encoding".)

NotJarvis的回复:

(原文:The difference between the two sections of code is fundamental. In the first one you are using a StreamReader which by design converts the read in data to a C# string. Unfortunately for you the data in a docx file is not string data, so you are converting binary data to a string, and later trying to convert it back to bytes (i.e. binary) data. If you wish to read the data into a stream and write it out again - use BinaryReader instead)

解决方案:
从词读数据应该使用Microsoft word互操作。 下面的示例显示了如何从字读取数据。 添加Microsoft.Office.Interop。词参考。
原文:

To read data from word you should use Microsoft word interop.

Below is the example shows how to read data from word.

Add Microsoft.Office.Interop.Word reference.

Application application = new Application();

// Open a doc file.
Document document = application.Documents.Open("D:\Test.docx");

String read = string.Empty;
List<string> data = new List<string>();

for (int i = 0; i < document.Paragraphs.Count; i++)
{
    string temp = document.Paragraphs[i + 1].Range.Text.Trim();
    if (temp != string.Empty)
        data.Add(temp);
}

foreach (var item in data)
{
    Console.WriteLine(item);
}

// Close word.
application.Quit();