Compare two VXML Document against each other [on hold] - java

Can anybody suggest me how to compare two VXML documents against each other and list the detailed differences ??
I had used XMLUnit (java) for the same, but was not that effective because logging the differences was not good enough.
Also tried with xml comparison, but VXML document is way different from that of XML document.
Could anyone suggest me any library in java or any interface (open source) that can help me out in this.


Is there a solution to parse wikipedia xml dump file in Java?

I am trying to parse this huge 25GB Plus wikipedia XML file. Any solution that will help would be appreciated. Preferably a solution in Java.
A Java API to parse Wikipedia XML dumps: WikiXMLJ (Last update was at Nov 2010).
Also, there is alive mirror that is maven-compatible with some bug fixes.
Ofcourse it's possible to parse huge XML files with Java, but you should use the right kind of XML parser - for example a SAX parser which processes the data element by element, and not a DOM parser which tries to load the whole document into memory.
It's impossible to give you a complete solution because your question is very general and superficial - what exactly do you want to do with the data?
Here is an active java project that may be used to parse wikipedia xml dump files: There are many examples of java programmes to transform wikipedia xml content into html, pdf, text, ... :
Yep, right. Do not use DOM. If you want to read small amount of data only, and want to store in your own POJO then you can use XSLT transformation also.
Transforming data into XML format which is then converted to some POJO using Castor/JAXB (XML to ojbect libraries).
Please share how you solve the problem so others can have better approach.
--- EDIt ---
Check the links below for better comparison between different parsers. It seems that STAX is better because it has control over the parser and it pulls data from parser when needed.
If you don't intend to write or change anything in that xml, consider using SAX. It keeps in memory one node at a time (instead of DOM, which tries to build the whole tree in the memory).
I would go with StAX as it provides more flexibility than SAX (also good option).
There is a standalone application that parses Wikipedia dumps into XML and plain text, called Wiki Parser.
In principle, you can parse the Wikipedia dump and then use Java to do anything you need with the XML or plain text.
The advantage of doing it that way is that WikiParser is very fast and takes only 2-3 hours to parse all current English Wikipedia articles.
I had this problem some days ago I found out that the wiki parser provided by does the work.
They stream the xml file and read it in chunks which you can then capture in callbacks.
This is a snippet of how I used it in Scala:
val parser = new XMLDumpParser(new BZip2CompressorInputStream(new BufferedInputStream(new FileInputStream(pathToWikipediaDump)), true))
parser.getContentHandler.setRevisionCallback(new RevisionCallback {
override def callback(revision: Revision): Unit = {
val page = revision.getPage
val title = page.getTitle
val articleText = revision.getText()
It streams the wikipedia, parses it, and everytime it finds a revision(Article) it will get its title,text and print the article's text. :)
--- Edit ---
Currently I am working on which I think does part of the pipeline which you might need.
Feel free to take a look at the code

XML Document Parsing in Java [closed]

I need to parse an XML document in java for a web service I'm making, and save the contents of it.
I need to save the name of the tags, if the tag has attributes save the attributes, and then save the data within those tags. These three items will be inserted into a database table with the three columns tags, attributes, and data.
I'm using the following java libraries:
org.w3c.dom.Document, org.w3c.dom.NodeList
Any help would be much appreciated.
DISCLAIMER: I don't want to plagiarize so I didn't include code but included links to other tutorials that are VERY helpful to this topic.
First, you should read w3c dom's java API because it tells you a lot of useful functions that are very related to your question.
Second, this website contains a useful tutorial that's easy to understand and it contains the necessary information for you to get the attributes of tags.
Third, this website gives you info on how to get tagName when you are looping through elements.
Fourth, you should always read related API, google, and then post a question if you are have no clue after a LONG period of time.
Lastly, you should post a difference question or research on database FIRST before asking that question here. This question should only be about XML Document Parsing in Java.
We are not supposed to help you do anything so the API is the best help for you (and google).

Best java library for creating, parsing and querying XML [closed]

I am creating a system which will store c++ code in an xml file so i need to be able to create the xml file from either a string or array of strings and then need to parse and query it at a later moment in the program. What is the best library to do this with ive been looking around at Xpath for querying and simple for creating the document although there doesnt seem to be much helpful documentation on it.
Many thanks
Use VTD-XML, works good with XML and much faster. Deals good with XPATH. I am already been to it. for more info please visit to

Best way to extract text (e.g. articles) from web page [closed]

So I am trying to write a program which can collect certain information from different articles and combine them. The step in which I am having trouble is extracting the article from the web page.
I was wondering whether you could provide any suggestions to java libraries/methods for extracting text from a web page?
I have also found this product:
and was wondering whether you think this is the way to go? If so can someone point me to a java implementation - cannot seem to find one although apparently it exists.
Many thanks
Clarification - I am more looking for an algorithm/library/method for detecting where where in an html dom tree a block of text that could be an article is located. Like Safari's reader function.
ps if you think this is much easier done in something like python just say - although my program has to run in Java as it should eventually run on a server (using java framework) I could try having it make use of python scripts - although would only do this if you advise that Python is the way to go.
Have a look at Apache Tika. It's meant to be used together with a crawler and can extract both text and metadata for you. You can also select various output types.
I have found an open source solution which was extremely highly rated.
A review on different text extraction algorithms:
It appears that diffbot does perform very well but is not open source. So in terms of open source, boiler pipe could be the way to go.
This is not the answer to every malformed HTML you can get, but most of the time jtidy does a good job cleaning the HTML and giving you an interface for accessing the various DOM nodes, and with that access to the text inside that nodes.

how to analyze an XML file? [closed]

I have java service that should receive an xml file containing several elements. i need to analyze this file extract matched elements and send them to their related services.
I hope to find a light way to do that as it's a heavy XML file.
Anyone knows a java framework or solution which can help to perform that ?
There are dozens of ways to read an XML file in Java. There are several ones to do it just with Java SE :
Google for them : you'll find documentation and tutorials. The api doc is also helpful. You'll find all these tools in package names.
XPath might be a good solution for this. Here is a link to the API documentation, but you'll also be able to find many different tutorials and getting started guides. Just search Google for "Java XPath tutorial" or similar.,5.0/docs/api/javax/xml/xpath/package-summary.html
There are probably several other implementations of XPath that may be better suited for your needs, so it would be worth looking at other open source implementations.
Pass it through an XSLT stylesheet. Java XSLT implementations are really good, keeping the input tree to a minimum and often using (at least by default) bytecode compilation instead of interpreting to make the transformation really fast. Matching input and producing the desired output is exactly what XSLT is suitable for, with its declarative style.
You can try to get the DOM tree of the XML file and then you can use this recursive method to go through the tree to find what you want.
Apache Commons Digester
Define your custom rules to read xml file, and invoke methods or create Objects to unmarshalling.
/students/student -> new Student();
/students/student/name -> invoke setName(name) for student
/students/student -> invoke addStudent(student) for each student
everything is optional, and you define that need
test with,
example in and example.
Use JDOM, it's small, fast, and easy to use.
Try VTD-XML, it is much faster than JDOM (upto 10x), and DOM4J or DOM, and also memory efficient. see the VTD-XML Wikipedia Entry for more Info